Pruned Tree: This image is a visual representation of the pruned decision tree generated by the J48 algorithm trained on the team average statistics across games in the last five seasons.
The best accuracy we achieved was 63.9% with simple logistic regression, trained on the dataset which contained the average statistics across the players on the home and away teams. This accuracy is similar to other amateur attempts. We found that by far ERA was the best predictor; when left out, accuracy drops by more than 7%, and when we only used ERA, the accuracy was still a full 62.1%. When using each individual player’s stats, including just on-base percentage (OBP) and pitching achieved higher accuracy by about half a percent over using all the statistics or any one of batting average, OPS, or slugging percentage. This suggests that while fans love hits, what matters the most is simply getting on base, and so walks are just fine.
Interestingly, a model with only salary totals for both teams achieves 54.5% accuracy, just one percent above predicting the home team to win every game. This bodes well for all the underdog fans out there. A model which considered only fielding percentage had accuracy just around 54%, which at only half a percentage point above the home team guess, is not very significant. It seems that while fielding is part of baseball, it does not play a significant role in deciding a winner.
Interestingly, a model with only salary totals for both teams achieves 54.5% accuracy, just one percent above predicting the home team to win every game. This bodes well for all the underdog fans out there. A model which considered only fielding percentage had accuracy just around 54%, which at only half a percentage point above the home team guess, is not very significant. It seems that while fielding is part of baseball, it does not play a significant role in deciding a winner.