In this example, we use the Hitters data set in the package ISLR2. We wish to predict a baseball player’s Salary on the basis of various statistics associated with performance in the previous year.
library(ISLR2)
head(Hitters)
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
## -Andy Allanson 293 66 1 30 29 14 1 293 66 1
## -Alan Ashby 315 81 7 24 38 39 14 3449 835 69
## -Alvin Davis 479 130 18 66 72 76 3 1624 457 63
## -Andre Dawson 496 141 20 65 78 37 11 5628 1575 225
## -Andres Galarraga 321 87 10 39 42 30 2 396 101 12
## -Alfredo Griffin 594 169 4 74 51 35 11 4408 1133 19
## CRuns CRBI CWalks League Division PutOuts Assists Errors
## -Andy Allanson 30 29 14 A E 446 33 20
## -Alan Ashby 321 414 375 N W 632 43 10
## -Alvin Davis 224 266 263 A W 880 82 14
## -Andre Dawson 828 838 354 N E 200 11 3
## -Andres Galarraga 48 46 33 N E 805 40 4
## -Alfredo Griffin 501 336 194 A W 282 421 25
## Salary NewLeague
## -Andy Allanson NA A
## -Alan Ashby 475.0 N
## -Alvin Davis 480.0 A
## -Andre Dawson 500.0 N
## -Andres Galarraga 91.5 N
## -Alfredo Griffin 750.0 A
anyNA(Hitters) #check if the data has any missing observations
## [1] TRUE
sum(is.na(Hitters$Salary)) #count the number of missing values in the Salary variable
## [1] 59
There are 59 missing observations in the Salary variable. We can use the na.omit() function or the drop_na() in the package tidyr to remove all of the rows that have missing values in any variable.
Hitters <- na.omit(Hitters)
dim(Hitters)
## [1] 263 20
After removing the missing data, we left with 263 observations in 20 variables.
We will use the caret package to perform ridge regression. In particular, we use the train() function, which can be used to fit ridge regression models, lasso models, and more.
First, we set a random seed so our results will be reproducible, since the choice of the cross-validation folds is random. Then, we split the data into \(75\%\) training data and \(25\%\) test data as follows:
library(caret)
set.seed(1) #to reproduce the results
# Partition (split) data and create index of selected values
index <- sample(1:nrow(Hitters), 0.7*nrow(Hitters))
# Create training and test data
train_df <- Hitters[index,]
test_df <- Hitters[-index,]
Next, we use the following syntax and arguments of S3 method for class ‘formula’ of the train() function to fit a ridge regression:
train(y ~ x, data, preProcess, method = "glmnet", tuneGrid = expand.grid(alpha = 0, lambda), trControl)
,
where
preProcess = c("center", "scale")
.method = "glmnet"
.tuneGrid
is used for an automatically selected range of
\(\lambda\) values. However, we can
implement the function over a grid of values. For example, we may
implement the function over a grid of values ranging from \(\lambda = 10^{10}\) to \(\lambda = 10^{-2}\) by setting
tuneGrid = expand.grid(alpha = 0, lambda = 10^seq(10, -2, length = 100))
.
By default, the train() function has an
alpha argument that determines what
type of model is fit. If alpha = 0 then
a ridge regression model is fit, and if alpha
= 1 then a lasso model is fit.tuneGrid
argument.# Create a vector of potential tuning values
Lambda_values <- 10^seq(10, -2, length = 100)
# Specify 10-fold cross-validation framework to select the best tuning parameter
ctr <- trainControl(method = "cv", number = 10, # 10-fold cross-validation
savePredictions = "all")
# Train the ridge model
fit_ridge <- train(Salary ~ .,
data = train_df,
preProcess = c("center", "scale"),
method = "glmnet",
tuneGrid = expand.grid(alpha = 0,
lambda = Lambda_values),
trControl = ctr)
Unlike least squares model performed with lm(), ridge regression runs the model many times for different tuning values of \(\lambda\). We can find the optimal value of \(\lambda\) that was obtained from the 10-fold cross-validation method by typing the following code:
fit_ridge$bestTune[2] #alternatively type: fit_ridge$bestTune$lambda
## lambda
## 36 174.7528
We can plot the \(\log(\lambda)\) against the root mean squared errors (RMSE) (the square root of the MSE) to see which \(\lambda\) value gives the least value of RMSE.
library(tidyverse)
ggplot(fit_ridge$results, aes(x = log(lambda),
y = RMSE)) +
geom_point(color = "darkgreen", size = 1.5) + geom_line() +
labs(x = expression(log(lambda)),
y = "RMSE") +
theme_bw()
We can also plot the order the predictor variables according to the importance of their contribution to the response variable using the varImp() function
ggplot(varImp(fit_ridge))
As clearly seen from the plot, none of the coefficients of the predictor variables are zero. This confirms that the ridge regression does not perform variable selection!
We can evaluate the accuracy of the ridge model using the RMSE() and R2() functions in the package caret. The RMSE denotes the root mean squared error and the closer the value of the RMSE to zero, the better the model, whereas the R2 denotes the coefficient of determination and the closer the value of the R2 to 1, the more accurate the prediction.
pred1 <- predict(fit_ridge, newdata = test_df)
data.frame(RMSE = RMSE(pred1, test_df$Salary),
Rsquared = R2(pred1, test_df$Salary))
## RMSE Rsquared
## 1 368.6155 0.4118318
In general, the ridge regression is less prone to overfit the training data than the (unpenalized) ordinary least squares (OLS) regression. Therefore, we expect that the ridge regression predict training data less well than OLS, but better predictions to unseen test data. Below we compare the prediction accuracy of ridge regression and OLS on test data. Instead of fitting the OLS using the lm() function, we use the train() function with the argument method = “lm”.
fit_ols <- train(Salary ~ .,
data = train_df,
preProcess = c("center", "scale"),
method = "lm",
trControl = ctr)
# OLS model performance
pred2 <- predict(fit_ols, newdata = test_df)
data.frame(RMSE = RMSE(pred2, test_df$Salary),
Rsquared = R2(pred2, test_df$Salary))
## RMSE Rsquared
## 1 371.1202 0.475941
From the output of the unpenalized linear model (OLS), we see that the RMSE(ridge) is less than the RMSE(OLS) and the Rsquared(ridge) is greater than the Rsquared(OLS). Thus, we conclude that the ridge regression can outperform least squares.
We now ask whether the lasso can yield either a more accurate or a more interpretable model than ridge regression. In order to fit a lasso model, we once again use train() function; however, this time we use the argument alpha = 1. Other than that change, we proceed just as we did in fitting a ridge model.
# Train the lasso model
fit_lasso <- train(Salary ~ .,
data = train_df,
preProcess = c("center", "scale"),
method = "glmnet",
tuneGrid = expand.grid(alpha = 1,
lambda = Lambda_values),
trControl = ctr)
To get the lasso regression model coefficient that has the best \(\lambda\), we use the following syntax:
round(coef(fit_lasso$finalModel, fit_lasso$bestTune$lambda), 3)
## 20 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 518.082
## AtBat .
## Hits 35.641
## HmRun .
## Runs .
## RBI .
## Walks 80.633
## Years .
## CAtBat .
## CHits 12.798
## CHmRun .
## CRuns 56.450
## CRBI 128.993
## CWalks .
## LeagueN 9.462
## DivisionW -68.664
## PutOuts 61.428
## Assists .
## Errors .
## NewLeagueN .
Here we see that 11 of the 19 coefficient estimates are exactly zero. So the lasso model with \(\lambda\) chosen by cross-validation contains only 8 predictors. Below the plot of these variables according to the importance of their contribution to the baseball player’s Salary.
ggplot(varImp(fit_lasso))
Finally, we calculate the RMSE and \(R^2\) of the lasso model using the test data and compare them with those of ridge regression. The results suggest that the lasso regression is slightly better than the ridge model.
pred3 <- predict(fit_lasso, newdata = test_df)
data.frame(RMSE = RMSE(pred3, test_df$Salary),
Rsquared = R2(pred3, test_df$Salary))
## RMSE Rsquared
## 1 367.2626 0.4126479
Elastic net model combines the properties of ridge and lasso regression. The model can be easily built using the train() function in the caret package, which automatically selects the optimal value of parameters alpha and lambda. We use the argument tuneLength = 10 , which specifies that 10 different combinations of values for alpha and lambda are to be tested.
# Train the lasso model
fit_elastic.net <- train(Salary ~ .,
data = train_df,
preProcess = c("center", "scale"),
method = "glmnet",
tuneLength = 10,
trControl = ctr)
The best values of \(\alpha\) and \(\lambda\) obtained from the 10-fold cross-validation method are
fit_elastic.net$bestTune
## alpha lambda
## 8 0.1 42.32331
As we did in the ridge and lasso models, we can calculate the RMSE and R-squared values for the elastic net regression model on the test data using the following code:
pred4 <- predict(fit_elastic.net, newdata = test_df)
data.frame(RMSE = RMSE(pred4, test_df$Salary),
Rsquared = R2(pred4, test_df$Salary))
## RMSE Rsquared
## 1 370.3724 0.4174883