Regularization

Ridge Regression and the Lasso

In this example, we use the Hitters data set in the package ISLR2. We wish to predict a baseball player’s Salary on the basis of various statistics associated with performance in the previous year.

library(ISLR2)
head(Hitters)

##                   AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
## -Andy Allanson      293   66     1   30  29    14     1    293    66      1
## -Alan Ashby         315   81     7   24  38    39    14   3449   835     69
## -Alvin Davis        479  130    18   66  72    76     3   1624   457     63
## -Andre Dawson       496  141    20   65  78    37    11   5628  1575    225
## -Andres Galarraga   321   87    10   39  42    30     2    396   101     12
## -Alfredo Griffin    594  169     4   74  51    35    11   4408  1133     19
##                   CRuns CRBI CWalks League Division PutOuts Assists Errors
## -Andy Allanson       30   29     14      A        E     446      33     20
## -Alan Ashby         321  414    375      N        W     632      43     10
## -Alvin Davis        224  266    263      A        W     880      82     14
## -Andre Dawson       828  838    354      N        E     200      11      3
## -Andres Galarraga    48   46     33      N        E     805      40      4
## -Alfredo Griffin    501  336    194      A        W     282     421     25
##                   Salary NewLeague
## -Andy Allanson        NA         A
## -Alan Ashby        475.0         N
## -Alvin Davis       480.0         A
## -Andre Dawson      500.0         N
## -Andres Galarraga   91.5         N
## -Alfredo Griffin   750.0         A

anyNA(Hitters) #check if the data has any missing observations

## [1] TRUE

sum(is.na(Hitters$Salary)) #count the number of missing values in the Salary variable

## [1] 59

There are 59 missing observations in the Salary variable. We can use the na.omit() function or the drop_na() in the package tidyr to remove all of the rows that have missing values in any variable.

Hitters <- na.omit(Hitters)
dim(Hitters)

## [1] 263  20

After removing the missing data, we left with 263 observations in 20 variables.

Ridge regression

We will use the caret package to perform ridge regression. In particular, we use the train() function, which can be used to fit ridge regression models, lasso models, and more.

First, we set a random seed so our results will be reproducible, since the choice of the cross-validation folds is random. Then, we split the data into \(75\%\) training data and \(25\%\) test data as follows:

library(caret)
set.seed(1) #to reproduce the results
# Partition (split) data and create index of selected values
index <- sample(1:nrow(Hitters), 0.7*nrow(Hitters)) 
# Create training and test data
train_df <- Hitters[index,]
test_df <- Hitters[-index,]

Next, we use the following syntax and arguments of S3 method for class ‘formula’ of the train() function to fit a ridge regression:

train(y ~ x, data, preProcess, method = "glmnet", tuneGrid = expand.grid(alpha = 0, lambda), trControl), where

y: Response variable.
x: Predictor variables.
data: Data frame.
preProcess: We may standardizes the variables so that they are on the same scale, which is always recommended. In this case, use preProcess = c("center", "scale").
method: The glmnet() function in the package glmnet is used to fit the ridge, lasso, elastic-net models, and more. In this case, we set method = "glmnet".
tuneGrid: A data frame with possible tuning values \(\lambda\). By default the argument lambda in tuneGrid is used for an automatically selected range of \(\lambda\) values. However, we can implement the function over a grid of values. For example, we may implement the function over a grid of values ranging from \(\lambda = 10^{10}\) to \(\lambda = 10^{-2}\) by setting tuneGrid = expand.grid(alpha = 0, lambda = 10^seq(10, -2, length = 100)). By default, the train() function has an alpha argument that determines what type of model is fit. If alpha = 0 then a ridge regression model is fit, and if alpha = 1 then a lasso model is fit.
trControl: A list of values that define how to control parameters for train model. For example, we may specify a K-fold cross-validation framework to train and test the model as well as to select the optimal tuning parameter from the initial vector of \(\lambda\) values that we specified in the tuneGrid argument.

# Create a vector of potential tuning values
Lambda_values <- 10^seq(10, -2, length = 100)
# Specify 10-fold cross-validation framework to select the best tuning parameter 
ctr <- trainControl(method = "cv", number = 10, # 10-fold cross-validation
                    savePredictions = "all")
# Train the ridge model
fit_ridge <- train(Salary ~ ., 
             data = train_df, 
             preProcess = c("center", "scale"), 
             method = "glmnet", 
             tuneGrid = expand.grid(alpha = 0, 
                                    lambda = Lambda_values), 
             trControl = ctr)

The optimal tuning hyperparameter lambda

Unlike least squares model performed with lm(), ridge regression runs the model many times for different tuning values of \(\lambda\). We can find the optimal value of \(\lambda\) that was obtained from the 10-fold cross-validation method by typing the following code:

fit_ridge$bestTune[2] #alternatively type: fit_ridge$bestTune$lambda

##      lambda
## 36 174.7528

We can plot the \(\log(\lambda)\) against the root mean squared errors (RMSE) (the square root of the MSE) to see which \(\lambda\) value gives the least value of RMSE.

library(tidyverse)
ggplot(fit_ridge$results, aes(x = log(lambda), 
                          y = RMSE)) + 
  geom_point(color = "darkgreen", size = 1.5) + geom_line() +
  labs(x = expression(log(lambda)),
       y = "RMSE") +
  theme_bw()

Variable importance

We can also plot the order the predictor variables according to the importance of their contribution to the response variable using the varImp() function

ggplot(varImp(fit_ridge))

As clearly seen from the plot, none of the coefficients of the predictor variables are zero. This confirms that the ridge regression does not perform variable selection!

Model prediction

We can evaluate the accuracy of the ridge model using the RMSE() and R2() functions in the package caret. The RMSE denotes the root mean squared error and the closer the value of the RMSE to zero, the better the model, whereas the R2 denotes the coefficient of determination and the closer the value of the R2 to 1, the more accurate the prediction.

pred1 <- predict(fit_ridge, newdata = test_df)
data.frame(RMSE = RMSE(pred1, test_df$Salary),
           Rsquared = R2(pred1, test_df$Salary))

##       RMSE  Rsquared
## 1 368.6155 0.4118318

Ridge vs least squares regression

In general, the ridge regression is less prone to overfit the training data than the (unpenalized) ordinary least squares (OLS) regression. Therefore, we expect that the ridge regression predict training data less well than OLS, but better predictions to unseen test data. Below we compare the prediction accuracy of ridge regression and OLS on test data. Instead of fitting the OLS using the lm() function, we use the train() function with the argument method = “lm”.

fit_ols <- train(Salary ~ ., 
             data = train_df, 
             preProcess = c("center", "scale"), 
             method = "lm",
             trControl = ctr)

# OLS model performance 
pred2 <- predict(fit_ols, newdata = test_df)
data.frame(RMSE = RMSE(pred2, test_df$Salary),
           Rsquared = R2(pred2, test_df$Salary))

##       RMSE Rsquared
## 1 371.1202 0.475941

From the output of the unpenalized linear model (OLS), we see that the RMSE(ridge) is less than the RMSE(OLS) and the Rsquared(ridge) is greater than the Rsquared(OLS). Thus, we conclude that the ridge regression can outperform least squares.

The Lasso

We now ask whether the lasso can yield either a more accurate or a more interpretable model than ridge regression. In order to fit a lasso model, we once again use train() function; however, this time we use the argument alpha = 1. Other than that change, we proceed just as we did in fitting a ridge model.

# Train the lasso model
fit_lasso <- train(Salary ~ ., 
             data = train_df, 
             preProcess = c("center", "scale"), 
             method = "glmnet", 
             tuneGrid = expand.grid(alpha = 1, 
                                    lambda = Lambda_values), 
             trControl = ctr)

To get the lasso regression model coefficient that has the best \(\lambda\), we use the following syntax:

round(coef(fit_lasso$finalModel, fit_lasso$bestTune$lambda), 3)

## 20 x 1 sparse Matrix of class "dgCMatrix"
##                  s1
## (Intercept) 518.082
## AtBat         .    
## Hits         35.641
## HmRun         .    
## Runs          .    
## RBI           .    
## Walks        80.633
## Years         .    
## CAtBat        .    
## CHits        12.798
## CHmRun        .    
## CRuns        56.450
## CRBI        128.993
## CWalks        .    
## LeagueN       9.462
## DivisionW   -68.664
## PutOuts      61.428
## Assists       .    
## Errors        .    
## NewLeagueN    .

Here we see that 11 of the 19 coefficient estimates are exactly zero. So the lasso model with \(\lambda\) chosen by cross-validation contains only 8 predictors. Below the plot of these variables according to the importance of their contribution to the baseball player’s Salary.

ggplot(varImp(fit_lasso))

Finally, we calculate the RMSE and \(R^2\) of the lasso model using the test data and compare them with those of ridge regression. The results suggest that the lasso regression is slightly better than the ridge model.

pred3 <- predict(fit_lasso, newdata = test_df)
data.frame(RMSE = RMSE(pred3, test_df$Salary),
           Rsquared = R2(pred3, test_df$Salary))

##       RMSE  Rsquared
## 1 367.2626 0.4126479

Elastic net regression

Elastic net model combines the properties of ridge and lasso regression. The model can be easily built using the train() function in the caret package, which automatically selects the optimal value of parameters alpha and lambda. We use the argument tuneLength = 10 , which specifies that 10 different combinations of values for alpha and lambda are to be tested.

# Train the lasso model
fit_elastic.net <- train(Salary ~ ., 
                         data = train_df, 
                         preProcess = c("center", "scale"), 
                         method = "glmnet", 
                         tuneLength = 10, 
                         trControl = ctr)

The best values of \(\alpha\) and \(\lambda\) obtained from the 10-fold cross-validation method are

fit_elastic.net$bestTune

##   alpha   lambda
## 8   0.1 42.32331

As we did in the ridge and lasso models, we can calculate the RMSE and R-squared values for the elastic net regression model on the test data using the following code:

pred4 <- predict(fit_elastic.net, newdata = test_df)
data.frame(RMSE = RMSE(pred4, test_df$Salary),
           Rsquared = R2(pred4, test_df$Salary))

##       RMSE  Rsquared
## 1 370.3724 0.4174883