Principal components regression (PCR) can be performed using the pcr() function, which is part of the pls package. The syntax for the pcr() function is similar to that for lm() with more additional arguments.
center = TRUE
is used to perform
mean centering and setting the argument scale = TRUE
will standardize
predictor variables.validation = "CV"
tells
R to calculate 10-fold cross-validation error for each
possible value of the number of principal components used. Also note
that you can specify validation = "LOOCV"
instead to
perform leave-one-out cross-validation
.summary()
.We use the Hitters data set in the package ISLR2 and We wish to perform a principle component analysis in order to predict a baseball player’s Salary. Recall that there are 59 missing observations in the Salary variable, hence, we remove these values using the na.omit() function. After removing the missing data, we left with 263 observations in 20 variables (19 predictors and one response).
library(ISLR2)
Hitters <- na.omit(Hitters)
head(Hitters)
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
## -Alan Ashby 315 81 7 24 38 39 14 3449 835 69
## -Alvin Davis 479 130 18 66 72 76 3 1624 457 63
## -Andre Dawson 496 141 20 65 78 37 11 5628 1575 225
## -Andres Galarraga 321 87 10 39 42 30 2 396 101 12
## -Alfredo Griffin 594 169 4 74 51 35 11 4408 1133 19
## -Al Newman 185 37 1 23 8 21 2 214 42 1
## CRuns CRBI CWalks League Division PutOuts Assists Errors
## -Alan Ashby 321 414 375 N W 632 43 10
## -Alvin Davis 224 266 263 A W 880 82 14
## -Andre Dawson 828 838 354 N E 200 11 3
## -Andres Galarraga 48 46 33 N E 805 40 4
## -Alfredo Griffin 501 336 194 A W 282 421 25
## -Al Newman 30 9 24 N E 76 127 7
## Salary NewLeague
## -Alan Ashby 475.0 N
## -Alvin Davis 480.0 A
## -Andre Dawson 500.0 N
## -Andres Galarraga 91.5 N
## -Alfredo Griffin 750.0 A
## -Al Newman 70.0 A
First, we apply the pcr()
function using the full
data set as a training data. The cross-validation, CV, score is provided
for each possible number of the 19 principle components, ranging from
\(M=0\) to \(M=19\).
library(pls)
set.seed(2)
pcr.fit1 <- pcr(Salary ~., data = Hitters,
scale = TRUE,
validation = "CV")
summary(pcr.fit1)
## Data: X dimension: 263 19
## Y dimension: 263 1
## Fit method: svdpc
## Number of components considered: 19
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 452 351.9 353.2 355.0 352.8 348.4 343.6
## adjCV 452 351.6 352.7 354.4 352.1 347.6 342.7
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 345.5 347.7 349.6 351.4 352.1 353.5 358.2
## adjCV 344.7 346.7 348.5 350.1 350.7 352.0 356.5
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
## CV 349.7 349.4 339.9 341.6 339.2 339.6
## adjCV 348.0 347.7 338.2 339.7 337.2 337.6
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 38.31 60.16 70.84 79.03 84.29 88.63 92.26 94.96
## Salary 40.63 41.58 42.17 43.22 44.90 46.48 46.69 46.75
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 96.28 97.26 97.98 98.65 99.15 99.47 99.75
## Salary 46.86 47.76 47.82 47.85 48.10 50.40 50.55
## 16 comps 17 comps 18 comps 19 comps
## X 99.89 99.97 99.99 100.00
## Salary 53.01 53.85 54.61 54.61
The summary()
function
provides the percentage of variance explained in the predictors and in
the response using different numbers of components. For example, using
only one component, \(M=1\), only
captures \(38.31\%\) of all the
information of the predictors and \(40.63\%\) of all the variance of response
variable. In contrast, using six components, \(M=6\), increases the values to \(88.63\%\) and \(46.48\%\) respectively. Of course, using
all 19 components, \(M= p = 19\), will
explain \(100\%\) of all the variance
in predictors. Note that \(M= p = 19\)
increases the percentage of variance explained in the response to \(54.61\%\).
Note that pcr()
function
reports the root mean squared
error; in order to obtain the usual MSE, we must square this
quantity. For example, a root mean squared error of \(351.9\) (PCA 1) corresponds to an \(351.9^{2} = 123,833.6\).
We can use the validationplot()
function to plot
the CV scores and examine the best number of PCs that minimizes the mean
sum of squares (smallest cross-validation error).
validationplot(pcr.fit1, val.type = "MSEP", legendpos = "top")
From the plot we see that only 6 components might be okay to be included in the model.
We now perform PCR on the training data and evaluate its test set performance. First, we set a random seed so our results will be reproducible, since the choice of the cross-validation folds is random. Then, we split the data into \(50\%\) training data and \(50\%\) test data as follows:
library(caret)
set.seed(1) #to reproduce the results
# Partition (split) data and create index of selected values
index <- sample(1:nrow(Hitters), 0.5*nrow(Hitters))
# Create training and test data
train_df <- Hitters[index,]
test_df <- Hitters[-index,]
set.seed(1)
pcr.fit2 <- pcr(Salary ~., data = Hitters,
subset = index,
scale = TRUE,
validation = "CV")
validationplot(pcr.fit2, val.type = "MSEP")
We find (from the MSEP plot above) that the lowest cross-validation error might occurs when \(M = 5\) or \(M = 6\) components are used. Thus, we may fit a regression considering only 5 principle components and evaluate its performance by calculating the MSE using the test data as follows:
pcr.fit3 <- pcr(Salary ~., data = train_df,
scale = TRUE,
ncomp = 5)
summary(pcr.fit3)
## Data: X dimension: 131 19
## Y dimension: 131 1
## Fit method: svdpc
## Number of components considered: 5
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 39.32 61.57 71.96 80.83 85.95
## Salary 43.87 43.93 47.36 47.37 49.52
Finally,
pcr.pred3 <- predict(pcr.fit3, test_df, ncomp = 5)
mean((pcr.pred3 - test_df$Salary)^2)
## [1] 142811.8
Recall that a smaller Mean Squared Error (MSE)
indicates
a better model and we don’t know if the \(MSE
= 142811.8\) is a small or large! In this case, we require an
additional model to facilitate the comparison of MSE values across all
models.
To assess whether using adding additional component can enhance the performance of this model, we compute the Mean Squared Error (MSE) for the regression model using 6 principle components (PCs):
pcr.fit4 <- pcr(Salary ~., data = train_df,
scale = TRUE,
ncomp = 6)
pcr.pred4 <- predict(pcr.fit4, test_df, ncomp = 6)
mean((pcr.pred4 - test_df$Salary)^2)
## [1] 141882.9
The MSE associated with a regression of 6 PCs is \(141882.9\) which is smaller than that one obtained from considering 5 PCs. Thus, we may conclude that using a regression model with only 6 components for the Hitters is good approach to predict the baseball player’s Salary.