Principal components regression (PCR)

Principal components regression (PCR) can be performed using the pcr() function, which is part of the pls package. The syntax for the pcr() function is similar to that for lm() with more additional arguments.

  • The default argument center = TRUE is used to perform mean centering and setting the argument scale = TRUE will standardize predictor variables.
  • Setting the argument validation = "CV" tells R to calculate 10-fold cross-validation error for each possible value of the number of principal components used. Also note that you can specify validation = "LOOCV" instead to perform leave-one-out cross-validation.
  • To determine the number of principal components worth keeping, you can examine the resulting fit using the syntax summary().

We use the Hitters data set in the package ISLR2 and We wish to perform a principle component analysis in order to predict a baseball player’s Salary. Recall that there are 59 missing observations in the Salary variable, hence, we remove these values using the na.omit() function. After removing the missing data, we left with 263 observations in 20 variables (19 predictors and one response).

library(ISLR2)
Hitters <- na.omit(Hitters)
head(Hitters)
##                   AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
## -Alan Ashby         315   81     7   24  38    39    14   3449   835     69
## -Alvin Davis        479  130    18   66  72    76     3   1624   457     63
## -Andre Dawson       496  141    20   65  78    37    11   5628  1575    225
## -Andres Galarraga   321   87    10   39  42    30     2    396   101     12
## -Alfredo Griffin    594  169     4   74  51    35    11   4408  1133     19
## -Al Newman          185   37     1   23   8    21     2    214    42      1
##                   CRuns CRBI CWalks League Division PutOuts Assists Errors
## -Alan Ashby         321  414    375      N        W     632      43     10
## -Alvin Davis        224  266    263      A        W     880      82     14
## -Andre Dawson       828  838    354      N        E     200      11      3
## -Andres Galarraga    48   46     33      N        E     805      40      4
## -Alfredo Griffin    501  336    194      A        W     282     421     25
## -Al Newman           30    9     24      N        E      76     127      7
##                   Salary NewLeague
## -Alan Ashby        475.0         N
## -Alvin Davis       480.0         A
## -Andre Dawson      500.0         N
## -Andres Galarraga   91.5         N
## -Alfredo Griffin   750.0         A
## -Al Newman          70.0         A

Perform PCR on the full dataset

First, we apply the pcr() function using the full data set as a training data. The cross-validation, CV, score is provided for each possible number of the 19 principle components, ranging from \(M=0\) to \(M=19\).

library(pls)
set.seed(2)
pcr.fit1 <- pcr(Salary ~., data = Hitters, 
               scale = TRUE, 
               validation = "CV")
summary(pcr.fit1)
## Data:    X dimension: 263 19 
##  Y dimension: 263 1
## Fit method: svdpc
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV             452    351.9    353.2    355.0    352.8    348.4    343.6
## adjCV          452    351.6    352.7    354.4    352.1    347.6    342.7
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       345.5    347.7    349.6     351.4     352.1     353.5     358.2
## adjCV    344.7    346.7    348.5     350.1     350.7     352.0     356.5
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        349.7     349.4     339.9     341.6     339.2     339.6
## adjCV     348.0     347.7     338.2     339.7     337.2     337.6
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X         38.31    60.16    70.84    79.03    84.29    88.63    92.26    94.96
## Salary    40.63    41.58    42.17    43.22    44.90    46.48    46.69    46.75
##         9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X         96.28     97.26     97.98     98.65     99.15     99.47     99.75
## Salary    46.86     47.76     47.82     47.85     48.10     50.40     50.55
##         16 comps  17 comps  18 comps  19 comps
## X          99.89     99.97     99.99    100.00
## Salary     53.01     53.85     54.61     54.61

The summary() function provides the percentage of variance explained in the predictors and in the response using different numbers of components. For example, using only one component, \(M=1\), only captures \(38.31\%\) of all the information of the predictors and \(40.63\%\) of all the variance of response variable. In contrast, using six components, \(M=6\), increases the values to \(88.63\%\) and \(46.48\%\) respectively. Of course, using all 19 components, \(M= p = 19\), will explain \(100\%\) of all the variance in predictors. Note that \(M= p = 19\) increases the percentage of variance explained in the response to \(54.61\%\).

Note that pcr() function reports the root mean squared error; in order to obtain the usual MSE, we must square this quantity. For example, a root mean squared error of \(351.9\) (PCA 1) corresponds to an \(351.9^{2} = 123,833.6\).

We can use the validationplot() function to plot the CV scores and examine the best number of PCs that minimizes the mean sum of squares (smallest cross-validation error).

validationplot(pcr.fit1, val.type = "MSEP", legendpos = "top")

From the plot we see that only 6 components might be okay to be included in the model.

Perform PCR on the training data and evaluate its test set performance

We now perform PCR on the training data and evaluate its test set performance. First, we set a random seed so our results will be reproducible, since the choice of the cross-validation folds is random. Then, we split the data into \(50\%\) training data and \(50\%\) test data as follows:

library(caret)
set.seed(1) #to reproduce the results
# Partition (split) data and create index of selected values
index <- sample(1:nrow(Hitters), 0.5*nrow(Hitters)) 
# Create training and test data
train_df <- Hitters[index,]
test_df <- Hitters[-index,]
set.seed(1)
pcr.fit2 <- pcr(Salary ~., data = Hitters, 
               subset = index,
               scale = TRUE, 
               validation = "CV")
validationplot(pcr.fit2, val.type = "MSEP")

We find (from the MSEP plot above) that the lowest cross-validation error might occurs when \(M = 5\) or \(M = 6\) components are used. Thus, we may fit a regression considering only 5 principle components and evaluate its performance by calculating the MSE using the test data as follows:

pcr.fit3 <- pcr(Salary ~., data = train_df, 
               scale = TRUE, 
               ncomp = 5)
summary(pcr.fit3)
## Data:    X dimension: 131 19 
##  Y dimension: 131 1
## Fit method: svdpc
## Number of components considered: 5
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps
## X         39.32    61.57    71.96    80.83    85.95
## Salary    43.87    43.93    47.36    47.37    49.52

Finally,

pcr.pred3 <- predict(pcr.fit3, test_df, ncomp = 5)
mean((pcr.pred3 - test_df$Salary)^2)
## [1] 142811.8

Recall that a smaller Mean Squared Error (MSE) indicates a better model and we don’t know if the \(MSE = 142811.8\) is a small or large! In this case, we require an additional model to facilitate the comparison of MSE values across all models.

To assess whether using adding additional component can enhance the performance of this model, we compute the Mean Squared Error (MSE) for the regression model using 6 principle components (PCs):

pcr.fit4 <- pcr(Salary ~., data = train_df, 
               scale = TRUE, 
               ncomp = 6)
pcr.pred4 <- predict(pcr.fit4, test_df, ncomp = 6)
mean((pcr.pred4 - test_df$Salary)^2)
## [1] 141882.9

The MSE associated with a regression of 6 PCs is \(141882.9\) which is smaller than that one obtained from considering 5 PCs. Thus, we may conclude that using a regression model with only 6 components for the Hitters is good approach to predict the baseball player’s Salary.