In this example, we apply bagging and random forests to the Boston dataset, sourced from the ISLR2 library. Data has 506 rows and 13 variables:
crim: per capita crime rate by town.
zn: proportion of residential land zoned for lots over 25,000 sq.ft.
indus: proportion of non-retail business acres per town.
chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox: nitrogen oxides concentration (parts per 10 million).
rm: average number of rooms per dwelling.
age: proportion of owner-occupied units built prior to 1940.
dis: weighted mean of distances to five Boston employment centres.
rad: index of accessibility to radial highways.
tax: full-value property-tax rate per $10,000.
ptratio: pupil-teacher ratio by town.
lstat: lower status of the population (percent).
medv: (Response variable) median value of owner-occupied homes in $1000s.
Load the required libraries and create training/test sets. Here, we
split the observations into a training set (say 50%) and a test data
(say 50%). In this example, we use the createDataPartition() function in the
caret library. For more information about this
function, type ?createDataPartition
.
# Load the required packages
library(ISLR2) # Introduction to Statistical Learning, Second Edition
library(caret) # Classification And REgression Training
library(rpart.plot) # Plot 'rpart' Models
# Create 50% training set and 50% test set (hold-out-validation)
set.seed(1)
Index <- createDataPartition(Boston[,"medv"], p = 0.5, list = FALSE)
# Alternatively, we may use the sample() function as below:
# Index <- sample(1:nrow(Boston), size = nrow(Boston)/2)
train.data <- Boston[Index,] # 254 observations
test.data <- Boston[-Index,] # 252 observations
Fit the tree to the training data using the train() function in the caret
library. The package caret call the rpart() function from the package
rpart (Recursive Partitioning and Regression Trees) and
train the model through cross-validation. For more information about
these functions, type ?rpart
and ?train
. The
important arguments of the train()
function are given below:
form: a formula that links the response (target) variable to the independent (features) variables.
data: the data to be used for modeling.
trControl: specify the control
parameters for the resampling method and other settings when training a
machine learning model. It is important when performing cross-validation
or bootstrapping to assess the model’s performance. Here are some of the
sub-arguments we can set within trControl
:
"cv"
for k-fold
cross-validation, "repeatedcv"
for repeated k-fold
cross-validation, "boot"
for bootstrap resampling, and
more.method: defines the algorithm
specifying which classification or regression model to use. In the case
of decision trees we use the default method as "rpart2"
where the model complexity is determined using the one-standard error
method.
preProcess: an optional argument
is used to perform mean centering and standardize predictor variables by
typing preProcess = c("center", "scale")
.
tuneLength: specifies the number
of values to consider during the hyperparameter tuning process. It
typically applies to the cp
parameter, which controls the
complexity of the tree in regression tree models.
tuneGrid: define a grid of
values for the cp
parameter.
ctrl <- trainControl(method = "cv", number = 5) # use 5-fold cross-validation
fit1 <- train(medv ~.,
data = train.data,
method ="rpart2", # rpart2 is an enhanced version of rpart to handle overfitting more effectively
trControl = ctrl,
preProcess = c("center", "scale"),
#tuneGrid = expand.grid(cp = seq(0.01,0.5, by = 0.01)),
tuneLength = 10 # consider 10 different values of the cp (maxdepth = 10)
)
fit1
## CART
##
## 254 samples
## 12 predictor
##
## Pre-processing: centered (12), scaled (12)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 204, 203, 202, 203, 204
## Resampling results across tuning parameters:
##
## maxdepth RMSE Rsquared MAE
## 1 6.476920 0.4353511 4.866632
## 2 5.757457 0.5318007 4.487140
## 3 5.159324 0.6340182 3.807647
## 4 4.693873 0.6966748 3.372292
## 5 4.651292 0.6959925 3.309177
## 6 4.483590 0.7148845 3.200488
## 7 4.524130 0.7096856 3.209500
## 8 4.524130 0.7096856 3.209500
## 9 4.524130 0.7096856 3.209500
## 10 4.524130 0.7096856 3.209500
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 6.
plot(fit1)
After examining the output and the fitted model plot, it becomes evident that the root mean squared errors (RMSE) exhibit minimal variation when the maximum tree depth exceeds 6. Consequently, we establish the maximum depth to be 6 and proceed to retrain the model using the rpart() function from the rpart library as shown below:
fit2 <- rpart(medv ~.,
train.data,
method = "anova", #default method for regression trees
maxdepth = 6)
fit2
## n= 254
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 254 18777.4900 22.37677
## 2) lstat>=8.96 163 3617.3250 18.01656
## 4) lstat>=14.445 91 1396.6870 15.27582
## 8) nox>=0.603 56 592.4936 13.63929
## 16) lstat>=19.645 24 204.5983 11.25833 *
## 17) lstat< 19.645 32 149.8000 15.42500 *
## 9) nox< 0.603 35 414.2389 17.89429 *
## 5) lstat< 14.445 72 673.1328 21.48056 *
## 3) lstat< 8.96 91 6510.6040 30.18681
## 6) rm< 7.4235 79 2937.9040 27.82405
## 12) rm< 6.7285 48 1746.0050 25.22292
## 24) rad< 7.5 41 353.5922 23.76585 *
## 25) rad>=7.5 7 795.5371 33.75714 *
## 13) rm>=6.7285 31 364.2774 31.85161 *
## 7) rm>=7.4235 12 228.2292 45.74167 *
Note that the method argument is
used to define the algorithm that we use to fit the model. It can be one
of "anova"
, "poisson"
, "class"
,
"exp"
. In the case of regression trees where the target
variable is numeric, we use the default method as
"anova"
.
Notice that the output of the fitted model indicates that only four
of the variables have been used in constructing the tree
(lstat
, nox
, rad
, and
rm
). In the context of a regression tree, the deviance is simply the sum of squared errors
(SSE) for the tree.
The output of the fitted model shows the steps of the trees splits (root, branch, leaf). For example, we start with 254 observations at the root node and we split the data first on the lstat variable (first branch). That is, out of all other features, lstat is the most predictive variable that optimizes a reduction in SSE.
We see that 163 observations for which lstat are more than or equal to 8.96 will go
to the left hand side of the first branch (denoted by 2)) and 91 observations for which lstat less than 8.96 will go to the right hand
side of the first branch (denoted by 3)).
The asterisks (*) indicate the leaf
nodes associated with prediction values.
For example, out of the 91 observations where \(\text{lstat} < 8.96\), we see that 12
observations are following the terminal node (leaf) \(\text{rm} \ge 7.4235\) with a predicted
median value of owner-occupied homes (on average) of \(\$45.74167\). The SSE of these observations
(deviance) is \(228.2292\).
We can easily visualize the model output by plotting the tree using the rpart.plot() function in the rpart.plot library using the following code.
rpart.plot(fit2, type = 2, digits = 3, fallen.leaves = FALSE)
Note that the rpart() function is
automatically applying a range of
cost complexity values and tuning parameter
(\(\alpha\)) to prune the tree that has the
lowest error. This function performs a 10-fold cross validation,
comparing the error that is associated with each \(\alpha\) on the hold-out validation data.
This process helps identify the optimal \(\alpha\) and, consequently, the most
suitable subtree to prevent overfitting the data. To gain insight into
the internal process, we can utilize the printcp() function as demonstrated below. For
instance, with no split, the
cross-validation error (xerror)
is 1.00716, whereas with 6
splits, this error decreases to 0.29698.
# Cost complexity pruning
printcp(fit2)
##
## Regression tree:
## rpart(formula = medv ~ ., data = train.data, method = "anova",
## maxdepth = 6)
##
## Variables actually used in tree construction:
## [1] lstat nox rad rm
##
## Root node error: 18777/254 = 73.927
##
## n= 254
##
## CP nsplit rel error xerror xstd
## 1 0.460635 0 1.00000 1.00716 0.123744
## 2 0.178111 1 0.53937 0.65804 0.092806
## 3 0.082413 2 0.36125 0.45864 0.078455
## 4 0.044075 3 0.27884 0.35033 0.063727
## 5 0.031787 4 0.23477 0.30186 0.062372
## 6 0.020767 5 0.20298 0.31705 0.062848
## 7 0.012680 6 0.18221 0.29698 0.062962
## 8 0.010000 7 0.16953 0.29739 0.062960
# Extract minimum error associated with the optimal cost complexity value for each model
cp <- fit2$cptable
We may also use the plotcp() function
to create a graph displaying the cross-validation errors (y-axis) in
relation to the cost complexity (CP) values (x-axis). This plot is
useful to choose the optimal value for cost complexity (CP)
that results in the lowest cross-validation error.
In this example, we observe diminishing values of the CP
after reaching 5 terminal nodes (tree size \(=
|T|\)). It’s worth noting that the dashed line intersects the
point where \(|T| = 4\).” Therefore, we
can explore the potential for improved prediction accuracy by employing
a pruned tree with 5 terminal nodes.
plotcp(fit2)
prune.fit <- prune(fit2, cp[5])
rpart.plot(prune.fit, type = 2, digits = 3, fallen.leaves = FALSE)
Recall that a smaller Mean Squared Error (MSE)
indicates
a better model.
Therefore, we can assess the model’s performance by computing the
MSE
using the test data:
# predictions on the test set using the unpruned tree
yhat <- predict(fit2, newdata = test.data)
# Mean square errors
mean((yhat - test.data$medv)^2)
## [1] 22.51632
In this example, the mean square errors is 22.52. Note that we don’t
know if this value is a small or large! We require an additional model
to facilitate the comparison of Mean Squared Error (MSE)
values across all models.
To assess whether pruning the tree can enhance the performance of
this model, we compute the Mean Squared Error (MSE)
for the
pruned tree
model using the following approach:
# predictions on the test set using the pruned tree
yhat.prune <- predict(prune.fit, newdata = test.data)
# Mean square errors
mean((yhat.prune - test.data$medv)^2)
## [1] 28.13026
The square root of the MSE associated with the pruned tree is 28.13026 which is larger than that one obtained from the unpruned tree (with number of levels (or “depth”) equals 6). Thus, we may conclude that the pruned tree model has lower prediction accuracy.
Here we apply bagging and random forests, using the randomForest() function in the randomForest package.
The argument mtry = 12
in the randomForest() function indicates that all 12
predictors should be considered for each split of the tree (i.e, bagging is used). The argument
importance = TRUE
indicates that the importance of
predictors should be assessed.
library(randomForest)
set.seed (1)
bag.boston <- randomForest(medv ~ .,
data = Boston,
subset = Index,
mtry = 12, #all 12 predictors should be considered for bagging
importance = TRUE)
bag.boston
##
## Call:
## randomForest(formula = medv ~ ., data = Boston, mtry = 12, importance = TRUE, subset = Index)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 12
##
## Mean of squared residuals: 13.99565
## % Var explained: 81.07
To measure the performance of the bagged regression tree on the test set, we plot the predicted values versus the actual values. The plot helps us assess how well the model predicts the actual values. If the points on the plot are close to the 45-degree line (the line where predicted values equal true values), it suggests that the model is a good fit. If the points deviate significantly from this line, it indicates prediction errors.
yhat.bag <- predict(bag.boston, newdata = test.data)
plot(yhat.bag, test.data$medv)
abline(0, 1)
mean((yhat.bag - test.data$medv)^2)
## [1] 13.03514
The test set MSE associated with the bagged regression tree is
13.03514 which is less than that one obtained from the unpruned tree
(22.52). This indicates that bagging algorithm for regression tree
yielded an improvement over unpruned tree. We could change the number of
trees grown by randomForest() using the
ntree
argument:
bag.boston <- randomForest(medv ~ .,
data = Boston,
subset = Index,
mtry = 12, #all 12 predictors should be considered for bagging
ntree = 25) # number of trees are to be constructed - default = 500
yhat.bag <- predict(bag.boston, newdata = test.data)
mean((yhat.bag - test.data$medv)^2)
## [1] 12.9277
Growing a random forest proceeds in exactly the same way that we use for applying the bagged model, except that we use a smaller value of the mtry argument.
By default, randomForest() uses \(p/3\) variables when building a random
forest of regression trees, and \(\sqrt{p}\) variables when building a random
forest of classification trees. Here we use mtry = 6
.
set.seed (1)
rf.boston <- randomForest(medv ~ .,
data = Boston,
subset = Index,
mtry = 6, #subset of 6 predictors - random forest algorithm
importance = TRUE)
yhat.rf <- predict(rf.boston, newdata = test.data)
mean((yhat.rf - test.data$medv)^2)
## [1] 13.13685
The test set MSE is 13.14; this indicates that random forests yielded an improvement over bagging in this case.
Using the importance() function, we can view the importance of each variable.
importance(rf.boston)
## %IncMSE IncNodePurity
## crim 13.648066 996.14565
## zn 2.733797 71.32496
## indus 8.535924 576.59262
## chas 1.439980 106.44409
## nox 12.190534 759.58333
## rm 32.665033 5653.80634
## age 11.045472 516.58558
## dis 13.335935 1177.37751
## rad 5.418443 105.91911
## tax 8.732893 421.49572
## ptratio 9.264858 670.83473
## lstat 33.108408 7205.24389
Two measures of variable importance are reported:
In the case of regression trees, the node impurity is measured by the
training RSS
. Plots of these importance measures can be
produced using the varImpPlot() function
as follows:
varImpPlot(rf.boston, main = "Variable Importance as Measured by a Random Forest")
The results indicate that across all of the trees considered in the
random forest, the wealth of the community (lstat
) and the
house size (rm
) are by far the two most important
variables.