Predicting End-of-Season OBP after Mar/Apr

Introduction / Objective

The goal of this project is to predict a Major League Baseball (MLB) hitter’s end-of-season On-Base Percentage (OBP) based on early season results, specifically after the first two months of the season.  This information could be useful to MLB team’s in a variety of ways including assessing player performance, progress, and value in order to make decisions about rosters, potential trades, pickups, lineup options, etc.  

This project was originally completed as part of my Advanced Sports Analytics class and has been adapted for this post.  

Data

The data for this project was sourced from two datasets:

  1. A dataset provided by our instructor which includes statistics of MLB hitters from March and April of the 2019 season.  
  2. Batting statistics from the 2015-2018 seasons sourced from FanGraphs using the baseballr R package.  
Though a variety of stats were included in the dataset and explored, the primary statistics utilized in the analysis were Walk Percentage (BB%), Batting Average (AVG), Line Drive Percentage (LD%), the ratio of Home Runs to Fly Balls (HR/FB), Strikeout Percentage (K%), Batting Average for Balls In Play (BABIP), and Fly Ball Percentage (FB%).  
    
# load historical hitter data
stats_fangraphs <- baseballr::fg_batter_leaders(x = 2015, y = 2018, league = 'all', qual = 200, ind = 1, exc_p = TRUE)
stats_fangraphs_agg <- fg_batter_leaders(x = 2015, y = 2018, qual = 200, ind = 0, exc_p = TRUE)

# load Mar/Apr 2019 batting data
batting_data <- read_csv('/Desktop/batting.csv')
    
Data Cleaning

A few steps were required to clean and format the dataset prior to the analysis:

  1. Column names were simplified for easier reference in later code.  
  2. Values stored in a “XX%” format were converted to numeric variables. 
  3. The HR/FB variable was converted to a proportion.  
    
# check NA's 
colSums(is.na(batting))

#simplify column names
colnames(batting) <- sub("^MarApr_", "", colnames(batting))
colnames(batting) <- sub('%', '_perc', colnames(batting))

#remove % signs
for (name in colnames(batting)) {
  if (grepl('_perc', name)) { 
          # if '_perc' is in the name, update the values in the column
          batting <- batting %>%
              mutate(!!name := as.numeric(str_replace_all(!!sym(name), '%', '')))
  } 
}

batting <- batting %>%
  mutate(`HR/FB` = as.numeric(str_replace_all(`HR/FB`, '%', '')) / 100)
    

Exploration and Visualization

We’ll begin by generating some plots to explore the data and get a feel for some of the important characteristics of the variables/hitting stats we’ll be working with.  Specifically, we’ll look for any signs of collinearity between variables, the variables’ distributions, and their relationship with OBP.  

Collinearity

There are a number of variables that appear to be collinear with each other. We’ll aim to avoid using these in conjunction when building predictive models.

  • PA/AB/H/HR/R/RBI
  • H/AVG/OBP/SLG.
  • BB% and K% are inversely related to O-Swing% and Contact% respectively.
  • BABIP/OBS/AVG/SLG
  • GB_perc/FB_perc are inversely related to each other.
Distributions

Many of the variables appear to be normally distributed.  There may be some exceptions like Home Runs (HR) and Stolen Bases (SB), but for the most part this should be good.  

Relationship to OBP - QQ Plots

Skimming through these plots, there does appear to be relationships between some of the stats and OBP, such as AVG and SLG.  This can give some guidance for what variables to start with when building a model.  

    
#calculate + visualize correlations of numeric variables:
cor_df <- batting[, 4:29] %>% na.omit()
correlations <- round(cor(cor_df), 2)
#correlations
corrplot(correlations)


# generate histograms for each variable
plot_list <- list()
names <- colnames(batting[, 4:29])
for (name in names){
  if (is.numeric(batting[[name]])) {
    plot <- batting %>% ggplot(aes(x = .data[[name]])) + 
      geom_histogram() +
      labs(title = name) +
      theme_minimal()
  } else {
    plot <- batting %>% 
      dplyr::select(.data[[name]]) %>% 
      na.omit() %>%
      ggplot(aes(x = .data[[name]])) + 
      geom_bar(stat = "count") +
      labs(title = name) +
      theme_minimal()
  }
  plot_list <- c(plot_list, list(plot))
}

grid.arrange(grobs = plot_list[1:9], ncol = 3)
grid.arrange(grobs = plot_list[10:18], ncol = 3)
grid.arrange(grobs = plot_list[19:27], ncol = 3)


# create qqplots of single variables
plot_list <- list()
for (name in names){
  plot <- batting[, 4:29] %>% 
    ggplot(aes(x = .data[[name]], y = OBP)) +
    geom_point() +
    geom_smooth() +
    labs(title = paste("OBP vs. ", name)) +
    theme_minimal()
  plot_list <- c(plot_list, list(plot))
}

grid.arrange(grobs = plot_list[1:9], ncol = 3)
grid.arrange(grobs = plot_list[10:18], ncol = 3)
grid.arrange(grobs = plot_list[19:27], ncol = 3)
    

Modeling OBP

Simple Regression / Predictor Selection

We’ll begin with a simple linear regression model to begin to get a sense of what variables correlate well with OBP.  This will help inform what predictors we’ll use later on when we employ additional modeling techniques.  

We’ll start with the formula below and trim it down based on VIF and Adjusted R-Squared.  Since we observed some notable collinearity from our previous exploration, reducing the VIF will be important here.  

OBP ~ H + BB% + K% + ISO + BABIP + AVG + SLG + LD% + GB% + FB% + IFFB% + HR/FB + Z-Swing% + Swing% + Contact%

This brings us to two options:

      1.  OBP ~ BB% + AVG + LD% + HR/FB
      2. OBP ~ BB% + K% + BABIP + FB% + HR/FB

The first set of predictors perform slightly better with a linear model, but since it won’t cost too much, we’ll check both throughout.  

    
# collect data into new df for linear model
lm_df <- batting[, 2:28] %>% na.omit()

# define/adjust formulas
form1 <- OBP ~ BB_perc + AVG + LD_perc + `HR/FB`
form2 <- OBP ~ BB_perc + K_perc + BABIP + FB_perc + `HR/FB` 

# check simple linear regression, summary, and vif
fit_lm <- function(data, formula){
  lm_model <- lm(formula, data)
  lm_summary <- summary(lm_model)
  vif_values <- vif(lm_model)
  return(list(summary = lm_summary, vif = vif_values))
}
model1 <- fit_lm(lm_df, form1)
model1$summary
model1$vif
model2 <- fit_lm(lm_df, form2)
model2$summary
        
    
Option 1:
Min 1Q Median 3Q Max
Values -0.019549 -0.007916 -0.001846 0.005723 0.044131
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0356135 0.0033496 10.632 <2e-16 ***
BB_perc 0.0074491 0.0001480 50.318 <2e-16 ***
AVG 0.8610646 0.0123612 69.659 <2e-16 ***
LD_perc 0.0002346 0.0001073 2.186 0.0296 *
HR/FB 0.0139324 0.0067911 2.052 0.0410 *
_
Value
Residual Standard Error 0.01096 on 315 DF
Multiple R-squared 0.9661
Adjusted R-squared 0.9657
F-statistic 2247 on 4 and 315 DF
p-value < 2.2e-16
BB_perc AVG LD_perc `HR/FB`
VIF 1.051944 1.207986 1.150687 1.116361
Option 2:
Min 1Q Median 3Q Max
Values -0.054625 -0.007276 -0.000794 0.005360 0.045671
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.122e-01 5.122e-03 21.906 <2e-16 ***
BB_perc 6.615e-03 1.698e-04 38.969 <2e-16 ***
K_perc -3.475e-03 9.741e-05 -35.674 <2e-16 ***
BABIP 6.078e-01 1.044e-02 58.200 <2e-16 ***
FB_perc 7.548e-04 8.183e-05 9.224 <2e-16 ***
`HR/FB` 1.670e-01 7.659e-03 21.800 <2e-16 ***
Value
Residual Standard Error 0.01254 on 314 DF
Multiple R-squared 0.9558
Adjusted R-squared 0.9551
F-statistic 1359 on 5 and 314 DF
p-value < 2.2e-16
BB_perc K_perc BABIP FB_perc `HR/FB`
VIF 1.057566 1.070236 1.086499 1.093337 1.085459

Train Models to Predict OBP

We’ll train four different models:

  1. Standard Multiple Linear Regression – Our simple regression from above yielded a fairly high adjusted r-squared of about 0.96, and linear models are highly interpretable, so it should be useful.  We’ll use this with both sets of predictors above.  
  2.  Random Forests – These models are generally pretty versatile, flexible enough to effectively capture non-linear relationships, and are relatively resilient to the effects of outliers, so they seem like they would likely be a good option here.  We’ll use this with both sets of predictors above.  
  3. Ridge Regression – Since ridge regression builds in some level of feature engineering, we’ll use all of the variables when building this model in case this captures something major that we missed from our simple regression analysis above.  
  4. Gradient Boosting – We’ll use this with both sets of predictors from above.  
The batter data from March/April 2019 will be split into a training set (75%) and a test set (25%) and we’ll use ten-fold cross validation to fit the models on the training data.  
    
# select data for models
reg_df <- batting[, 4:29]

# split data into train/test
set.seed(123) 
train <- reg_df[sample(nrow(reg_df), round(0.75 * nrow(reg_df))) ,] 
test <- reg_df[setdiff(seq_len(nrow(reg_df)), rownames(train)), ]


# set cv protocol
control <- trainControl(method = "repeatedcv", #repeated cv for better accuracy metrics
                        number = 10,
                        repeats = 3,
                        savePredictions = TRUE)

# Train a linear regression model
fit_lm1 <- train(form1,
                method = "lm", data = train,
                trControl = control)

fit_lm2 <- train(form2,
                method = "lm", data = train,
                trControl = control)

# Train a random forest model
fit_rf1 <- train(form1,
                method = "rf", data = train,
                trControl = control)
fit_rf2 <- train(form2,
                method = "rf", data = train,
                trControl = control)

# Train a ridge model
fit_ridge <- train(OBP~H + BB_perc + K_perc + ISO + BABIP + AVG + SLG + LD_perc + GB_perc + FB_perc + IFFB_perc + `HR/FB` + `Z-Swing_perc` + Swing_perc + Contact_perc, 
                   method = 'ridge', data = train, 
                   trControl = control, 
                   preProcess = c("center", "scale"),
                   tuneGrid = expand.grid(lambda = seq(0, 1, by = 0.1)))

# Train a gbm
fit_gbm1 <- train(form1,
                method = "gbm", data = train,
                trControl = control)
fit_gbm2 <- train(form2,
                method = "gbm", data = train,
                trControl = control)

        
    

Evaluating the Models

Metrics: We’ll use Mean Average Error (MAE), Root Mean Squared Error (RMSE), and Adjusted R-Squared (Adj. R-Sq) as metrics to evaluate our models. 

Results: All four models performed well, but the random forest on the first set of predictors (fit_rf1) performed best on all metrics (MAE, RMSE, Adj. R-Sq). The MAE of about 0.006 suggests that it can estimate a player’s OBP within 6 points on average, which is pretty good. 

However, it is important to note that so far, we’ve only demonstrated we can predict current OBP based on current stats.  This isn’t exactly the same as predicting end-of-season OBP.  To make predictions about the end of the season, we’ll have to make approximations of our predictors to apply the model too. This will likely reduce the accuracy of the final model, so we’ll have to evaluate further.  

 

Test Metrics on All Models (Train & Test Sets):
model train/test mae rmse rsq
gbm1 test 0.00880 0.0117 0.888
gbm1 train 0.00797 0.0106 0.971
gbm2 test 0.00960 0.0125 0.875
gbm2 train 0.00830 0.0111 0.968
lm1 test 0.00814 0.0117 0.890
lm1 train 0.00797 0.0104 0.971
lm2 test 0.0102 0.0133 0.860
lm2 train 0.00884 0.0124 0.960
rf1 test 0.00644 0.00932 0.933
rf1 train 0.00519 0.00711 0.988
rf2 test 0.00887 0.0114 0.926
rf2 train 0.00715 0.00964 0.983
ridge test 0.00848 0.0114 0.895
ridge train 0.00757 0.00988 0.974
    
# Function to compute metrics
compute_metrics <- function(model, data, model_name) {
  # Add predictions to data frame
  data <- data %>%
    mutate(prediction = predict(model, data))

  # Compute metrics and add model name
  metrics(data, truth = OBP, estimate = prediction) %>%
    mutate(model = model_name)
}

# Compute metrics for all models
models <- list(fit_lm1, fit_lm1, fit_rf1, fit_rf2, fit_ridge, fit_gbm1, fit_gbm2)
model_names <- c("lm1 train", "lm2 train", "rf1 train", "rf2 train", 
                 'ridge train', 'gmb1 train', 'gmb2 train')

metrics_df <- purrr::map2_df(models, model_names, compute_metrics, data = train)

# View the results
metrics_df

# add predictions of model back into training set
train$`Linear regression1` <- predict(fit_lm1, train)
train$`Linear regression2` <- predict(fit_lm2, train)
train$`Random forest1` <- predict(fit_rf1, train)
train$`Random forest2` <- predict(fit_rf2, train)
train$`Ridge` <- predict(fit_ridge, train)
train$`Gradient Boost1` <- predict(fit_gbm1, train)
train$`Gradient Boost2` <- predict(fit_gbm2, train)

# Evaluate the performance on training set
train_metrics <- bind_rows(
  metrics(train, truth = OBP, estimate = `Linear regression1`) %>% mutate(model="lm1 train"),
  metrics(train, truth = OBP, estimate = `Linear regression2`) %>% mutate(model="lm2 train"),
  metrics(train, truth = OBP, estimate = `Random forest1`) %>% mutate(model = "rf1 train"),
  metrics(train, truth = OBP, estimate = `Random forest2`) %>% mutate(model = "rf2 train"),
  metrics(train, truth = OBP, estimate = `Ridge`) %>% mutate(model = 'ridge train'),
  metrics(train, truth = OBP, estimate = `Gradient Boost1`) %>% mutate(model = 'gbm1 train'),
  metrics(train, truth = OBP, estimate = `Gradient Boost2`) %>% mutate(model = 'gbm2 train')
)

# Create the new columns
test$`Linear regression1` <- predict(fit_lm1, test)
test$`Linear regression2` <- predict(fit_lm2, test)
test$`Random forest1` <- predict(fit_rf1, test)
test$`Random forest2` <- predict(fit_rf2, test)
test$`Ridge` <- predict(fit_ridge, test)
test$`Gradient Boost1` <- predict(fit_gbm1, test)
test$`Gradient Boost2` <- predict(fit_gbm2, test)


# Evaluate the performance on test set
test_metrics <- bind_rows(
  metrics(test, truth = OBP, estimate = `Linear regression1`) %>% mutate(model = "lm1 test"),
  metrics(test, truth = OBP, estimate = `Linear regression2`) %>% mutate(model = "lm2 test"),
  metrics(test, truth = OBP, estimate = `Random forest1`) %>% mutate(model = "rf1 test"),
  metrics(test, truth = OBP, estimate = `Random forest2`) %>% mutate(model = "rf2 test"),
  metrics(test, truth = OBP, estimate = `Ridge`) %>% mutate(model = 'ridge test'),
  metrics(test, truth = OBP, estimate = `Gradient Boost1`) %>% mutate(model = 'gbm1 test'), 
  metrics(test, truth = OBP, estimate = `Gradient Boost2`) %>% mutate(model = 'gbm2 test')
)

train_metrics

# Combine and format the results
result_df <- bind_rows(train_metrics, test_metrics) %>%
  dplyr::select(-.estimator) %>%
  arrange(.metric, model) %>%
  spread(.metric, .estimate)

result_df
    

Predicting End-of-Season

In order to make end-of-season predictions, we’ll assume some amount of regression to the mean for each of the predictors (and possibly overall OBP too). However, instead of just using the league mean of each stat to determine what to regress to, we’ll also take into account the player’s performance historically. To get a sense of how much a player’s historical performance influences their current performance, we’ll compare 2015-2017 stats to 2018 stats for each of the predictors (and OBP). Using a simple linear regression for each predictor, we’ll use the r-squared value as a rough proportion of how much we should factor in the player’s historical performance relative to the overall league mean. Predictors with higher r-squared values mean that past performance correlates better with future performance in that stat. Using the r-squared for the weighting will allow us to give more weight to past performance for such predictors.

Determine Proportion of 2018 Variance explained by 2015-17

Linear Regression of Predictor Variables

This analysis yields the following R-Squared values for each of the variables, and by extension, the proportion or weighting that we’ll use for each to factor in the historical statistics.  

Predictor R-Squared
OBP 0.22
AVG 0.21
BB% 0.47
LD% 0.22
HR/FB 0.22
K_perc 0.61
BABIP 0.13
FB_perc 0.49
    
# prep historical data for analysis
historical_data <- as.data.frame(stats_fangraphs) %>% 
  dplyr::select(playerid, Season, Name, OBP, PA, AVG, BB_pct, LD_pct, HR_FB, K_pct, BABIP, FB_pct)
hist17 <- historical_data %>% filter(Season <= 2017) 
hist18 <- historical_data %>% filter(Season == 2018)

colnames(hist18)[3:12] <- paste0(colnames(hist18)[3:12], "_18")
colnames(hist17)[3:12] <- paste0(colnames(hist17)[3:12], "_17")
historical_data <- merge(hist17, hist18, by = "playerid", all = TRUE) %>% na.omit()

# run linear regressions on 2018~2015 stats for each variable
lm_obp <- lm(OBP_18 ~ OBP_17, historical_data)
lm_avg <- lm(AVG_18 ~ AVG_17, historical_data)
lm_bbp <- lm(BB_pct_18 ~ BB_pct_17, historical_data)
lm_ldp <- lm(LD_pct_18 ~ LD_pct_17, historical_data)
lm_hrfb <- lm(HR_FB_18 ~ HR_FB_17, historical_data)
lm_kp <- lm(K_pct_18 ~ K_pct_17, historical_data)
lm_babip <- lm(BABIP_18 ~ BABIP_17, historical_data)
lm_fbp <- lm(FB_pct_18 ~ FB_pct_17, historical_data)

# plot the graphs with r^2 and p values displayed
plot_obp <- historical_data %>% 
  ggplot(aes(x = OBP_18, y = OBP_17)) + 
  geom_point() + 
  geom_smooth(method = 'lm') + 
  labs(title = 'OBP') +
  geom_text(x = .25, y = .42, size = 2,
            label = paste("R^2 =", format(summary(lm_obp)$r.squared, digits = 3), "\n",
                          "p =", format(summary(lm_obp)$coefficients[2, 4], digits = 3))) + 
  theme_minimal()

plot_avg <- historical_data %>% 
  ggplot(aes(x = AVG_18, y = AVG_17)) + 
  geom_point() + 
  geom_smooth(method = 'lm') + 
  labs(title = 'AVG') +
  geom_text(x = .18, y = .32, size = 2,
            label = paste("R^2 =", format(summary(lm_avg)$r.squared, digits = 3), "\n",
                          "p =", format(summary(lm_avg)$coefficients[2, 4], digits = 3))) +
  theme_minimal()


plot_bbp <- historical_data %>% 
  ggplot(aes(x = BB_pct_18, y = BB_pct_17)) + 
  geom_point() + 
  geom_smooth(method = 'lm') + 
  labs(title = 'BBP') +
  geom_text(x = 3, y = 15, size = 2,
            label = paste("R^2 =", format(summary(lm_bbp)$r.squared, digits = 3), "\n",
                          "p =", format(summary(lm_bbp)$coefficients[2, 4], digits = 3))) +
  theme_minimal()


plot_LDP <- historical_data %>% 
  ggplot(aes(x = LD_pct_18, y = LD_pct_17)) + 
  geom_point() + 
  geom_smooth(method = 'lm') + 
  labs(title = 'LDP') +
  geom_text(x = 15, y = 28, size = 2,
            label = paste("R^2 =", format(summary(lm_obp)$r.squared, digits = 3), "\n",
                          "p =", format(summary(lm_obp)$coefficients[2, 4], digits = 3))) +
  theme_minimal()


plot_hrfb <- historical_data %>% 
  ggplot(aes(x = HR_FB_18, y = HR_FB_17)) + 
  geom_point() + 
  geom_smooth(method = 'lm') + 
  labs(title = 'HR_FB') +
  geom_text(x = 4, y = 32, size = 2, 
            label = paste("R^2 =", format(summary(lm_obp)$r.squared, digits = 3), "\n",
                          "p =", format(summary(lm_obp)$coefficients[2, 4], digits = 3))) +
  theme_minimal()

plot_kp <- historical_data %>% 
  ggplot(aes(x = K_pct_18, y = K_pct_17)) + 
  geom_point() + 
  geom_smooth(method = 'lm') + 
  labs(title = 'K%') +
  geom_text(x = 10, y = 32, size = 2,
            label = paste("R^2 =", format(summary(lm_kp)$r.squared, digits = 3), "\n",
                          "p =", format(summary(lm_kp)$coefficients[2, 4], digits = 3))) + 
  theme_minimal()

plot_babip <- historical_data %>% 
  ggplot(aes(x = BABIP_18, y = BABIP_17)) + 
  geom_point() + 
  geom_smooth(method = 'lm') + 
  labs(title = 'BABIP') +
  geom_text(x = .2, y = .35, size = 2,
            label = paste("R^2 =", format(summary(lm_babip)$r.squared, digits = 3), "\n",
                          "p =", format(summary(lm_babip)$coefficients[2, 4], digits = 3))) + 
  theme_minimal()

plot_fbp <- historical_data %>% 
  ggplot(aes(x = FB_pct_18, y = FB_pct_17)) + 
  geom_point() + 
  geom_smooth(method = 'lm') + 
  labs(title = 'OBP') +
  geom_text(x = 20, y = 50, size = 2,
            label = paste("R^2 =", format(summary(lm_fbp)$r.squared, digits = 3), "\n",
                          "p =", format(summary(lm_fbp)$coefficients[2, 4], digits = 3))) + 
  theme_minimal()

grid.arrange(plot_obp, plot_avg, plot_bbp, plot_LDP, plot_hrfb, plot_kp, 
             plot_babip, plot_fbp, ncol = 3) 
    

Determining What To Regress Each Statistic To

Now, we’ll determine what “mean” to regress to by taking a weighted average of the league mean and each individual player’s historical performance in each stat, where the weighting is based on the R-Squared value from the linear models above.  The following table includes those values and their corresponding equations.  

Predictor League Average Weighted Mean to Regress to
OBP 0.3156614 0.28 * OBP + (1-0.28) * 0.3213672
AVG 0.2487154 0.20 * AVG + (1-0.20) * 0.2511231
BB% 8.1192369 0.51 * BB% + (1-0.51) * 8.5211663
LD% 20.7917329 0.29 * LD% + (1-0.29) * 20.8712743
HR/FB 12.0421304 0.29 * HR/FB + (1-0.29) * 12.9485961
K% 22.0227345 0.61 * OBP + (1-0.61) * 22.0227345
BABIP 0.2977250 0.13 * BABIP + (1-0.13) * 0.2977250
FB% 34.7594595 0.49 * FB% + (1-0.49) * 34.7594595

This will give us a “mean” to regress to for each stat that incorporates an “appropriate” proportion of the player’s historical performance. Note that for any players in the 2019 season without historical data, we’ll just use the league mean as a substitute.

    
# determine the league mean's for each stat across both years
historical_data_agg <- as.data.frame(stats_fangraphs_agg)[, c(1:51, 289)] %>% 
  dplyr::select(playerid, Name, OBP, PA, AVG, BB_pct, LD_pct, HR_FB, K_pct, BABIP, FB_pct, )
stats <- c('OBP', 'AVG', 'BB%', 'LD%', 'HR/FB', 'K%', 'BABIP', 'FB%', 'PA')
means <- c(mean(historical_data_agg$OBP), 
           mean(historical_data_agg$AVG),
           mean(historical_data_agg$BB_pct),
           mean(historical_data_agg$LD_pct),
           mean(historical_data_agg$HR_FB), 
           mean(historical_data_agg$K_pct),
           mean(historical_data_agg$BABIP),
           mean(historical_data_agg$FB_pct),
           mean(historical_data_agg$PA)) 
data.frame(stats, means)

# save r^2 proportions to use for weighted average
r_obp <- 1*as.numeric(format(summary(lm_obp)$r.squared, digits = 3))
r_avg <- 1*as.numeric(format(summary(lm_avg)$r.squared, digits = 3))
r_bbp <- 1*as.numeric(format(summary(lm_bbp)$r.squared, digits = 3))
r_ldp <- 1*as.numeric(format(summary(lm_ldp)$r.squared, digits = 3))
r_hrfb <- 1*as.numeric(format(summary(lm_hrfb)$r.squared, digits = 3))
r_kp <- 1*as.numeric(format(summary(lm_kp)$r.squared, digits = 3))
r_babip <- 1*as.numeric(format(summary(lm_babip)$r.squared, digits = 3))
r_fbp <- 1*as.numeric(format(summary(lm_fbp)$r.squared, digits = 3))


# compute weighted averages to regress to
regress_to_data <- historical_data_agg %>%
  mutate(OBP_RT = r_obp * OBP + (1-r_obp) * mean(historical_data_agg$OBP), 
         AVG_RT = r_avg * AVG + (1-r_avg) * mean(historical_data_agg$AVG), 
         BB_pct_RT = r_bbp * BB_pct + (1-r_bbp) * mean(historical_data_agg$BB_pct), 
         LD_pct_RT = r_ldp * LD_pct + (1-r_ldp) * mean(historical_data_agg$LD_pct), 
         HR_FB_RT = r_hrfb * HR_FB + (1-r_hrfb) * mean(historical_data_agg$HR_FB), 
         K_pct_RT = r_hrfb * K_pct + (1-r_kp) * mean(historical_data_agg$K_pct), 
         BABIP_RT = r_hrfb * BABIP + (1-r_babip) * mean(historical_data_agg$BABIP), 
         FB_pct_RT = r_hrfb * FB_pct + (1-r_hrfb) * mean(historical_data_agg$FB_pct)) %>% 
  dplyr::select(playerid, Name, OBP_RT, AVG_RT, BB_pct_RT, LD_pct_RT, HR_FB_RT, K_pct_RT, 
                BABIP_RT, FB_pct_RT)

head(regress_to_data)
    

Regression to the Weighted-Mean

To regress to the weighted-mean, we’ll take another weighted average factoring in Plate Appearances (PA) as our sample size to determine how much weight to apply to the Mar/Apr 2019 stats vs. the weighted-mean value computed above. To do this, we’ll use the formula

Regressed STAT = Current_STAT * (PA / N) + STAT_RT * (1 – PA / N),

where we’ll determine N empirically to match the regressed stat’s standard deviation and distribution to that of the historical values.

To determine N, we’ll test about a series of values from 50 to 400 and select the one that causes the distribution of the Regressed AVG to have the closest standard deviation to that of the historical data. We’ll also examine the distributions graphically and adjust if necessary so they look similar between the regressed stat and the historical stat. Here are the values that were determined for N for each stat:

Statistic N
OBP 180
AVG 180
BB% 160
LD% 200
HR/FB 400
K% 130
BABIP 160
FB% 130

Using these values for N gives a distribution that is relatively consistent with the historical distribution for these stats, indicating that it is an appropriate value to use for our weighting.  

Code to Determine N

The code for BB% is included as a sample here.  The other statistics were determined via the same method.  I know I should have written a loop or function to condense the code for the repeated operation across all statistics, but I didn’t, so I won’t copy it all here.  

    
# organize data
bat_df <- merge(batting, regress_to_data, by = "playerid", all.x = TRUE) %>%
  dplyr::select(playerid, Name.x, PA, OBP, BB_perc, AVG, LD_perc, `HR/FB`, 
                K_perc, BABIP, FB_perc, FullSeason_OBP, 
                OBP_RT, BB_pct_RT, AVG_RT, LD_pct_RT, HR_FB_RT, K_pct_RT, BABIP_RT, FB_pct_RT) %>%
  mutate(OBP_RT = ifelse(is.na(OBP_RT), mean(historical_data_agg$OBP), OBP_RT), 
         BB_pct_RT = ifelse(is.na(BB_pct_RT), mean(historical_data_agg$BB_pct), BB_pct_RT),
         AVG_RT = ifelse(is.na(AVG_RT), mean(historical_data_agg$AVG), AVG_RT), 
         LD_pct_RT = ifelse(is.na(LD_pct_RT), mean(historical_data_agg$LD_pct), LD_pct_RT),
         HR_FB_RT = ifelse(is.na(HR_FB_RT), mean(historical_data_agg$HR_FB), HR_FB_RT), 
         K_pct_RT = ifelse(is.na(K_pct_RT), mean(historical_data_agg$K_pct), K_pct_RT),
         BABIP_RT = ifelse(is.na(BABIP_RT), mean(historical_data_agg$BABIP), BABIP_RT),
         FB_pct_RT = ifelse(is.na(FB_pct_RT), mean(historical_data_agg$FB_pct), FB_pct_RT))
         
# determine optimal PA weighting for BB_perc
sample <- seq(50, 400, 10)
sds <- numeric()
hist_sd <- sd(historical_data_agg$BB_pct)
for (n in sample) {
  reg <- bat_df$PA / n * bat_df$BB_perc +
         (1 - bat_df$PA / n) * bat_df$BB_pct_RT
  sds <- c(sds, abs(sd(reg) - hist_sd))
}

plot_df <- data.frame(sample, sds)
plot_df %>% ggplot(aes(x = sample, y = sds)) + 
  geom_line() + 
  labs(title = 'BB_perc') +
  theme_minimal()

plot_df %>% filter(sds == min(sds))

# Compute values-to-regress-to based on empircally determined N
bat_df <- bat_df %>%
  mutate(obp_reg = PA / 180 * OBP + (1 - PA / 180) * OBP_RT,
         avg_reg = PA / 170 * AVG + (1 - PA / 170) * AVG_RT, 
         bbp_reg = PA / 160 * BB_perc + (1 - PA / 160) * BB_pct_RT,
         ldp_reg = PA / 200 * LD_perc + (1 - PA / 200) * LD_pct_RT,
         hrfb_reg = PA / 400 * `HR/FB` + (1 - PA / 400) * HR_FB_RT, 
         kp_reg = PA / 110 * K_perc + (1 - PA / 110) * K_pct_RT, 
         babip_reg = PA / 160 * BABIP + (1 - PA / 160) * BABIP_RT, 
         fbp_reg = PA / 120 * FB_perc + (1 - PA / 120) * FB_pct_RT)
         
# plot densities for comparison
dens1 <- bat_df %>% ggplot(aes(x = bbp_reg)) +
  geom_density() +
  labs(title = 'BB% Regressed Distribution') +
  scale_x_continuous(limits = c(0, 25)) +
  theme_minimal()

dens2 <- historical_data_agg %>% ggplot(aes(x = BB_pct)) +
  geom_density() +
  labs(title = 'BB% Historical Distribution') +
  scale_x_continuous(limits = c(0, 25)) +
  theme_minimal()

grid.arrange(dens1, dens2, ncol = 1)
    

This yields an N of 180 and difference in Standard Deviation (between our distribution and the historical) of 0.01997226.  

Combining OBP Model with Regressed Statistics to Predict End-of-Season OBP

Now that we have our regressed statistics for each of the predictors, we can apply the OBP models built above to those in order to estimate the end-of-season OBP.

This yields an MAE of about 0.23 for our best performing random forest model. This is ok, but 20 points of OBP is certainly not as good as the 6 we were able to see on the test set of the early season data, so we’re losing a lot as we project toward the end of a season. We can achieve a very modest improvement if we regress this toward the weighted mean for OBP similar to the other statistics, however, this only brings us to about an MAE of 0.22.

    
# organize data so we can predict EOS OBP
eos_df <- merge(batting, bat_df[, c(1, 21:28)], by = "playerid", all.x = TRUE, suffixes = c("", ".x")) %>%
  dplyr::select(-OBP, -AVG, -BB_perc, -LD_perc, -`HR/FB`, -K_perc, -BABIP, -FB_perc)
colnames(eos_df)[colnames(eos_df) == 'obp_reg'] <- 'OBP'
colnames(eos_df)[colnames(eos_df) == 'avg_reg'] <- 'AVG'
colnames(eos_df)[colnames(eos_df) == 'bbp_reg'] <- 'BB_perc'
colnames(eos_df)[colnames(eos_df) == 'ldp_reg'] <- 'LD_perc'
colnames(eos_df)[colnames(eos_df) == 'hrfb_reg'] <- 'HR/FB'
colnames(eos_df)[colnames(eos_df) == 'kp_reg'] <- 'K_perc'
colnames(eos_df)[colnames(eos_df) == 'babip_reg'] <- 'BABIP'
colnames(eos_df)[colnames(eos_df) == 'fbp_reg'] <- 'FB_perc'

# predict EOS OBP using each model
eos_df$eos_obp_lm1 <- predict(fit_lm1, newdata = eos_df)
eos_df$eos_obp_lm2 <- predict(fit_lm2, newdata = eos_df)
eos_df$eos_obp_rf1 <- predict(fit_rf1, newdata = eos_df)
eos_df$eos_obp_rf2 <- predict(fit_rf2, newdata = eos_df)
eos_df$eos_obp_ridge <- predict(fit_ridge, newdata = eos_df)
eos_df$eos_obp_gbm1 <- predict(fit_gbm1, newdata = eos_df)
eos_df$eos_obp_gbm2 <- predict(fit_gbm2, newdata = eos_df)

# plot prediction vs actual value for each model
eos_df %>% ggplot(aes(x = FullSeason_OBP)) + 
  geom_point(aes(y = eos_obp_rf1, color = mako_palette[75])) +
  geom_point(aes(y = eos_obp_rf2, color = mako_palette[50])) +
  geom_abline(slope = 1, intercept = 0, linetype = 'dashed') +
  labs(title = 'Predicted Value vs. Actual Value', 
       subtitle = 'For End-of-Season OBP') +
  scale_y_continuous(name = 'Predicted OBS') +
  scale_x_continuous(name = 'Actual 2019 End-of-Season OBS') +
  scale_color_manual(values = c(mako_palette[75], mako_palette[50]),
                     labels = c("Random Forest 2", "Random Forest 1")) +
  theme_minimal() +
  guides(color = guide_legend(title = "Model")) +
  theme(legend.position = "right") 
  
# residuals plot
eos_df <- eos_df %>% mutate(rf_residuals1 = eos_obp_rf1 - FullSeason_OBP, 
                            rf_residuals2 = eos_obp_rf2 - FullSeason_OBP)
eos_df %>% ggplot(aes(x = FullSeason_OBP)) +
  geom_point(aes(y = rf_residuals1, color = mako_palette[75])) +
  geom_point(aes(y = rf_residuals2, color = mako_palette[50])) +
  geom_hline(yintercept = 0, linetype = 'dashed') +
  scale_color_manual(values = c(mako_palette[75], mako_palette[50]),
                     labels = c("Random Forest 2", "Random Forest 1")) +
  labs(title = 'Residuals of RF Predictions') +
  theme_minimal() +
  guides(color = guide_legend(title = "Model")) +
  theme(legend.position = "right")

# compute MAE, MSE metrics
MAE1 <- mean(abs(eos_df$rf_residuals1))
RMSE1 <- sqrt(mean((eos_df$rf_residuals1)^2))
MAE2 <- mean(abs(eos_df$rf_residuals2))
RMSE2 <- sqrt(mean((eos_df$rf_residuals2)^2))
metrics <- c('MAE1', 'RMSE1', 'MAE2', 'RMSE2')
values <- c(MAE1, RMSE1, MAE2, RMSE2)
data.frame(metrics, values)

# obp_weighted_pred
eos_df$weighted_prediction = eos_df$OBP * .5 + eos_df$eos_obp_rf1 * (1 -.5)

# compute MAE, MSE metrics on weighted prediction
MAE <- mean(abs(eos_df$weighted_residuals))
RMSE <- sqrt(mean((eos_df$weighted_residuals)^2))
metrics <- c('MAE', 'RMSE')
values <- c(MAE, RMSE)
data.frame(metrics, values)
    

Final Results / Conclusions

With OBP Weighting
Metrics Values
MAE 0.02251379
RMSE 0.02859316
Without OBP Weighting
Metrics Values
MAE (Model 1) 0.02300048
RMSE (Model 1) 0.02928569
MAE (Model 2) 0.03557778
RMSE (Model 2) 0.04386678

With a combination of machine learning techniques and regression to the mean, we were able to build a model that projects player’s end-of-season On-Base Percentage based on a selection of batting statistics from the early months (Mar/Apr) of a season.  This model can predict with an average error of about .022 (22 pts of OBP).  This is an ok result, though I was hoping for a bit higher considering how accurate the initial OBP models looked.  However, we can also consider that the model tends to over project players who finished on the lower end of OBP and under project players who finished on the higher end, so depending on the context, it may still prove useful.  

Leave a Comment

Your email address will not be published. Required fields are marked *