In this example, we apply bagging and random forests to the Boston dataset, sourced from the ISLR2 library. Data has 506 rows and 13 variables:

crim: per capita crime rate by town.
zn: proportion of residential land zoned for lots over 25,000 sq.ft.
indus: proportion of non-retail business acres per town.
chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox: nitrogen oxides concentration (parts per 10 million).
rm: average number of rooms per dwelling.
age: proportion of owner-occupied units built prior to 1940.
dis: weighted mean of distances to five Boston employment centres.
rad: index of accessibility to radial highways.
tax: full-value property-tax rate per $10,000.
ptratio: pupil-teacher ratio by town.
lstat: lower status of the population (percent).
medv: (Response variable) median value of owner-occupied homes in $1000s.

Fitting Regression Trees

Step 1: Data partition

Load the required libraries and create training/test sets. Here, we split the observations into a training set (say 50%) and a test data (say 50%). In this example, we use the createDataPartition() function in the caret library. For more information about this function, type ?createDataPartition.

# Load the required packages
library(ISLR2) # Introduction to Statistical Learning, Second Edition
library(caret) # Classification And REgression Training
library(rpart.plot) # Plot 'rpart' Models
# Create 50% training set and 50% test set (hold-out-validation)
set.seed(1)
Index <- createDataPartition(Boston[,"medv"], p = 0.5, list = FALSE)
# Alternatively, we may use the sample() function as below:  
# Index <- sample(1:nrow(Boston), size = nrow(Boston)/2)
train.data <- Boston[Index,]  # 254 observations
test.data <- Boston[-Index,]  # 252 observations

Step 2: Model building

Fit the tree to the training data using the train() function in the caret library. The package caret call the rpart() function from the package rpart (Recursive Partitioning and Regression Trees) and train the model through cross-validation. For more information about these functions, type ?rpart and ?train. The important arguments of the train() function are given below:

form: a formula that links the response (target) variable to the independent (features) variables.
data: the data to be used for modeling.
trControl: specify the control parameters for the resampling method and other settings when training a machine learning model. It is important when performing cross-validation or bootstrapping to assess the model’s performance. Here are some of the sub-arguments we can set within trControl:
- method: specifies the resampling method to be used, such as "cv" for k-fold cross-validation, "repeatedcv" for repeated k-fold cross-validation, "boot" for bootstrap resampling, and more.
- number: it defines the number of folds in k-fold cross-validation ($=k$).
- repeats: for repeated k-fold cross-validation, it specifies the number of times the cross-validation process is repeated.
- p: For bootstrap resampling, it sets the proportion of data to resample in each iteration.
method: defines the algorithm specifying which classification or regression model to use. In the case of decision trees we use the default method as "rpart2" where the model complexity is determined using the one-standard error method.
preProcess: an optional argument is used to perform mean centering and standardize predictor variables by typing preProcess = c("center", "scale").
tuneLength: specifies the number of values to consider during the hyperparameter tuning process. It typically applies to the cp parameter, which controls the complexity of the tree in regression tree models.
tuneGrid: define a grid of values for the cp parameter.

ctrl <- trainControl(method = "cv", number = 5) # use 5-fold cross-validation
fit1 <- train(medv ~., 
             data = train.data,
             method ="rpart2", # rpart2 is an enhanced version of rpart to handle overfitting more effectively
             trControl = ctrl,
             preProcess = c("center", "scale"),
             #tuneGrid = expand.grid(cp = seq(0.01,0.5, by = 0.01)),
             tuneLength = 10 # consider 10 different values of the cp (maxdepth = 10)
             ) 
fit1

## CART 
## 
## 254 samples
##  12 predictor
## 
## Pre-processing: centered (12), scaled (12) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 204, 203, 202, 203, 204 
## Resampling results across tuning parameters:
## 
##   maxdepth  RMSE      Rsquared   MAE     
##    1        6.476920  0.4353511  4.866632
##    2        5.757457  0.5318007  4.487140
##    3        5.159324  0.6340182  3.807647
##    4        4.693873  0.6966748  3.372292
##    5        4.651292  0.6959925  3.309177
##    6        4.483590  0.7148845  3.200488
##    7        4.524130  0.7096856  3.209500
##    8        4.524130  0.7096856  3.209500
##    9        4.524130  0.7096856  3.209500
##   10        4.524130  0.7096856  3.209500
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 6.

plot(fit1)

After examining the output and the fitted model plot, it becomes evident that the root mean squared errors (RMSE) exhibit minimal variation when the maximum tree depth exceeds 6. Consequently, we establish the maximum depth to be 6 and proceed to retrain the model using the rpart() function from the rpart library as shown below:

fit2 <- rpart(medv ~., 
              train.data, 
              method  = "anova", #default method for regression trees
              maxdepth = 6) 
fit2

## n= 254 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 254 18777.4900 22.37677  
##    2) lstat>=8.96 163  3617.3250 18.01656  
##      4) lstat>=14.445 91  1396.6870 15.27582  
##        8) nox>=0.603 56   592.4936 13.63929  
##         16) lstat>=19.645 24   204.5983 11.25833 *
##         17) lstat< 19.645 32   149.8000 15.42500 *
##        9) nox< 0.603 35   414.2389 17.89429 *
##      5) lstat< 14.445 72   673.1328 21.48056 *
##    3) lstat< 8.96 91  6510.6040 30.18681  
##      6) rm< 7.4235 79  2937.9040 27.82405  
##       12) rm< 6.7285 48  1746.0050 25.22292  
##         24) rad< 7.5 41   353.5922 23.76585 *
##         25) rad>=7.5 7   795.5371 33.75714 *
##       13) rm>=6.7285 31   364.2774 31.85161 *
##      7) rm>=7.4235 12   228.2292 45.74167 *

Note that the method argument is used to define the algorithm that we use to fit the model. It can be one of "anova", "poisson", "class", "exp". In the case of regression trees where the target variable is numeric, we use the default method as "anova".

Notice that the output of the fitted model indicates that only four of the variables have been used in constructing the tree (lstat, nox, rad, and rm). In the context of a regression tree, the deviance is simply the sum of squared errors (SSE) for the tree.

The output of the fitted model shows the steps of the trees splits (root, branch, leaf). For example, we start with 254 observations at the root node and we split the data first on the lstat variable (first branch). That is, out of all other features, lstat is the most predictive variable that optimizes a reduction in SSE.

We see that 163 observations for which lstat are more than or equal to 8.96 will go to the left hand side of the first branch (denoted by 2)) and 91 observations for which lstat less than 8.96 will go to the right hand side of the first branch (denoted by 3)). The asterisks (*) indicate the leaf nodes associated with prediction values.
For example, out of the 91 observations where $\text{lstat} < 8.96$, we see that 12 observations are following the terminal node (leaf) $\text{rm} \ge 7.4235$ with a predicted median value of owner-occupied homes (on average) of $\$45.74167$. The SSE of these observations (deviance) is $228.2292$.

Step 3: Model visualizing

We can easily visualize the model output by plotting the tree using the rpart.plot() function in the rpart.plot library using the following code.

rpart.plot(fit2, type = 2, digits = 3, fallen.leaves = FALSE)

Step 4: Cost complexity pruning

Note that the rpart() function is automatically applying a range of cost complexity values and tuning parameter ($\alpha$) to prune the tree that has the lowest error. This function performs a 10-fold cross validation, comparing the error that is associated with each $\alpha$ on the hold-out validation data. This process helps identify the optimal $\alpha$ and, consequently, the most suitable subtree to prevent overfitting the data. To gain insight into the internal process, we can utilize the printcp() function as demonstrated below. For instance, with no split, the cross-validation error (xerror) is 1.00716, whereas with 6 splits, this error decreases to 0.29698.

# Cost complexity pruning
printcp(fit2)

## 
## Regression tree:
## rpart(formula = medv ~ ., data = train.data, method = "anova", 
##     maxdepth = 6)
## 
## Variables actually used in tree construction:
## [1] lstat nox   rad   rm   
## 
## Root node error: 18777/254 = 73.927
## 
## n= 254 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.460635      0   1.00000 1.00716 0.123744
## 2 0.178111      1   0.53937 0.65804 0.092806
## 3 0.082413      2   0.36125 0.45864 0.078455
## 4 0.044075      3   0.27884 0.35033 0.063727
## 5 0.031787      4   0.23477 0.30186 0.062372
## 6 0.020767      5   0.20298 0.31705 0.062848
## 7 0.012680      6   0.18221 0.29698 0.062962
## 8 0.010000      7   0.16953 0.29739 0.062960

# Extract minimum error associated with the optimal cost complexity value for each model
cp <- fit2$cptable

We may also use the plotcp() function to create a graph displaying the cross-validation errors (y-axis) in relation to the cost complexity (CP) values (x-axis). This plot is useful to choose the optimal value for cost complexity (CP) that results in the lowest cross-validation error.
In this example, we observe diminishing values of the CP after reaching 5 terminal nodes (tree size $= |T|$). It’s worth noting that the dashed line intersects the point where $|T| = 4$.” Therefore, we can explore the potential for improved prediction accuracy by employing a pruned tree with 5 terminal nodes.

plotcp(fit2)

prune.fit <- prune(fit2, cp[5]) 
rpart.plot(prune.fit, type = 2, digits = 3, fallen.leaves = FALSE)

Step 5: Model evaluation

Recall that a smaller Mean Squared Error (MSE) indicates a better model.
Therefore, we can assess the model’s performance by computing the MSE using the test data:

# predictions on the test set using the unpruned tree
yhat <- predict(fit2, newdata = test.data)
# Mean square errors
mean((yhat - test.data$medv)^2)

## [1] 22.51632

In this example, the mean square errors is 22.52. Note that we don’t know if this value is a small or large! We require an additional model to facilitate the comparison of Mean Squared Error (MSE) values across all models.

To assess whether pruning the tree can enhance the performance of this model, we compute the Mean Squared Error (MSE) for the pruned tree model using the following approach:

# predictions on the test set using the pruned tree
yhat.prune <- predict(prune.fit, newdata = test.data)
# Mean square errors
mean((yhat.prune - test.data$medv)^2)

## [1] 28.13026

The square root of the MSE associated with the pruned tree is 28.13026 which is larger than that one obtained from the unpruned tree (with number of levels (or “depth”) equals 6). Thus, we may conclude that the pruned tree model has lower prediction accuracy.

Bagging and Random Forests

Here we apply bagging and random forests, using the randomForest() function in the randomForest package.

Perform bagging

The argument mtry = 12 in the randomForest() function indicates that all 12 predictors should be considered for each split of the tree (i.e, bagging is used). The argument importance = TRUE indicates that the importance of predictors should be assessed.

library(randomForest)
set.seed (1)
bag.boston <- randomForest(medv ~ ., 
                           data = Boston,
                           subset = Index, 
                           mtry = 12, #all 12 predictors should be considered for bagging
                           importance = TRUE)
bag.boston

## 
## Call:
##  randomForest(formula = medv ~ ., data = Boston, mtry = 12, importance = TRUE,      subset = Index) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 12
## 
##           Mean of squared residuals: 13.99565
##                     % Var explained: 81.07

To measure the performance of the bagged regression tree on the test set, we plot the predicted values versus the actual values. The plot helps us assess how well the model predicts the actual values. If the points on the plot are close to the 45-degree line (the line where predicted values equal true values), it suggests that the model is a good fit. If the points deviate significantly from this line, it indicates prediction errors.

yhat.bag <- predict(bag.boston, newdata = test.data)
plot(yhat.bag, test.data$medv)
abline(0, 1)

mean((yhat.bag - test.data$medv)^2)

## [1] 13.03514

The test set MSE associated with the bagged regression tree is 13.03514 which is less than that one obtained from the unpruned tree (22.52). This indicates that bagging algorithm for regression tree yielded an improvement over unpruned tree. We could change the number of trees grown by randomForest() using the ntree argument:

bag.boston <- randomForest(medv ~ ., 
                           data = Boston,
                           subset = Index, 
                           mtry = 12, #all 12 predictors should be considered for bagging
                           ntree = 25) # number of trees are to be constructed - default = 500
yhat.bag <- predict(bag.boston, newdata = test.data)
mean((yhat.bag - test.data$medv)^2)

## [1] 12.9277

Perform random forest

Growing a random forest proceeds in exactly the same way that we use for applying the bagged model, except that we use a smaller value of the mtry argument.

By default, randomForest() uses $p/3$ variables when building a random forest of regression trees, and $\sqrt{p}$ variables when building a random forest of classification trees. Here we use mtry = 6.

set.seed (1)
rf.boston <- randomForest(medv ~ ., 
                           data = Boston,
                           subset = Index, 
                           mtry = 6, #subset of 6 predictors - random forest algorithm
                           importance = TRUE)
yhat.rf <- predict(rf.boston, newdata = test.data)
mean((yhat.rf - test.data$medv)^2)

## [1] 13.13685

The test set MSE is 13.14; this indicates that random forests yielded an improvement over bagging in this case.

Using the importance() function, we can view the importance of each variable.

importance(rf.boston)

##           %IncMSE IncNodePurity
## crim    13.648066     996.14565
## zn       2.733797      71.32496
## indus    8.535924     576.59262
## chas     1.439980     106.44409
## nox     12.190534     759.58333
## rm      32.665033    5653.80634
## age     11.045472     516.58558
## dis     13.335935    1177.37751
## rad      5.418443     105.91911
## tax      8.732893     421.49572
## ptratio  9.264858     670.83473
## lstat   33.108408    7205.24389

Two measures of variable importance are reported:

The first is based upon the mean decrease of accuracy in predictions on the out of bag samples when a given variable is permuted.
The second is a measure of the total decrease in node impurity that results from splits over that variable, averaged over all trees.

In the case of regression trees, the node impurity is measured by the training RSS. Plots of these importance measures can be produced using the varImpPlot() function as follows:

varImpPlot(rf.boston, main = "Variable Importance as Measured by a Random Forest")

The results indicate that across all of the trees considered in the random forest, the wealth of the community (lstat) and the house size (rm) are by far the two most important variables.

Bagging and Random Forests for Regression Trees

Esam Mahdi

2024-11-24