Linear Regression

Menu Pricing in Italian Restaurant Dataset

Menu pricing in a new Italian restaurant in New York City

Consider the dataset that is available in the file nyc.csv which can be found from the website of the book “A Modern Approach to Regression with R” written by Simon Sheather . The data are in the form of the average of customer views on

$Y:~$ Price = the price (in $US) of dinner (including one drink and a tip).
$X_1:~$ Food = customer rating of the food (out of 30).
$X_2:~$ Décor = customer rating of the decor (out of 30).
$X_3:~$ Service = customer rating of the service (out of 30).
$D:~$ East = dummy variable = 1/0 if the restaurant is east/west of Fifth Avenue.

For this data, we need to:

Develop a regression model that directly predicts the price of dinner (in dollars) using a subset or all of the 4 potential predictor variables listed above.
Determine which of the predictor variables Food, Décor and Service has the largest estimated effect on Price? Is this effect also the most statistically significant?
If the aim is to choose the location of the restaurant so that the price achieved for dinner is maximized, should the new restaurant be on the east or west of Fifth Avenue?
Does it seem possible to achieve a price premium for “setting a new standard for high-quality service in Manhattan” for Italian restaurants?

Import the data

Read and attach the data and named it as "nyc". It is more convenient to attach the data using the function $\color{blue}{\text{attach()}}$, so you can access its variables directly without calling the data.

nyc <- read.csv("https://gattonweb.uky.edu/sheather/book/docs/datasets/nyc.csv", header=TRUE)
attach(nyc)
head(nyc) # See the first 6 rows of this data

##   Case          Restaurant Price Food Decor Service East
## 1    1 Daniella Ristorante    43   22    18      20    0
## 2    2  Tello's Ristorante    32   20    19      19    0
## 3    3          Biricchino    34   21    13      18    0
## 4    4             Bottino    41   20    20      17    0
## 5    5          Da Umberto    54   24    19      21    0
## 6    6            Le Madri    52   22    22      21    0

str(nyc) # Display the structure of this data

## 'data.frame':    168 obs. of  7 variables:
##  $ Case      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Restaurant: chr  "Daniella Ristorante" "Tello's Ristorante" "Biricchino" "Bottino" ...
##  $ Price     : int  43 32 34 41 54 52 34 34 39 44 ...
##  $ Food      : int  22 20 21 20 24 22 22 20 22 21 ...
##  $ Decor     : int  18 19 13 20 19 22 16 18 19 17 ...
##  $ Service   : int  20 19 18 17 21 21 21 21 22 19 ...
##  $ East      : int  0 0 0 0 0 0 0 1 1 1 ...

Data visualization

You can visualize the matrix scatterplot, box-plot, histogram, density, and correlation plots of the data using the function $\color{blue}{\text{ggpairs()}}$ in the package GGally. Also, to get more more insight examination, you can add an interact plot by applying the $\color{blue}{\text{ggplotly()}}$ function.

Note that the variable East is an "integer" and not "character". Thus, you may (optional step and not mandatory) either use the function $\color{blue}{\text{as.factor()}}$ to change its class to "factor" or $\color{blue}{\text{as.character()}}$ to change its class to "character".

require(GGally)
require(plotly)
p <- ggpairs(nyc[,-c(1,2)], title="Correlogram") #No need to include the variables "case" and "Restaurant", so you may exclude them
ggplotly(p)

Try also the function $\color{blue}{\text{pairs.panels()}}$ in the psych package to get a nice scatterplot matrix:

library(psych)
pairs.panels(nyc[,-c(1,2)])

Based on the matrix scatterplot and correlation coefficients and their corresponding p-values, there is indeed an association (linear) between the price of dinner (Price) and the other predictors.

Simple Linear Regression

Let’s first fit a simple linear regression using the variable (Food) as a single predictor.

Step 1: Fit the model

You can fit the simple linear regression and get the output summary of this fit by using the function $\color{blue}{\text{lm()}}$ as follows.

fit_simple <- lm(Price ~ Food)
summary(fit_simple)

## 
## Call:
## lm(formula = Price ~ Food)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.8860  -3.9470   0.2056   4.2513  26.9919 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.8321     5.8631  -3.041  0.00274 ** 
## Food          2.9390     0.2834  10.371  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.261 on 166 degrees of freedom
## Multiple R-squared:  0.3932, Adjusted R-squared:  0.3895 
## F-statistic: 107.6 on 1 and 166 DF,  p-value: < 2.2e-16

Note: if you do not attach data from the previous step, then the code lm(Price ~ Food) will gives an error. In this case, the correct command should lm(Price ~ Food, data = nyc) or lm(nyc$Price ~ nyc$Food).

From the output summary, you can write the initial model as follows:

\[\widehat{\text{Price}} = -17.8321 + 2.9390 \times \text{Food}\]

Step 2: Assessing the overall accuracy of the model

The F-statistic = 107.6 (very large) and its p-value $< 2.2e-16$ (very small) indicates that the overall model is useful for predicting the price of the food. However, the Adjusted R-squared = 0.3895 is very low, which suggests that only $38.95\%$ of the variability in the price can be explain by this model. This is an indication that the model needs to be improved!

Step 3: Extract more information from the summary

You can extract lots of different information from the $\color{blue}{\text{lm()}}$ output. To see what’s inside, use the $\color{blue}{\text{names()}}$ as follows:

names(fit_simple)

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

For example, to get the regression coefficients of this model, you can use following the command

fit_simple$coefficients

## (Intercept)        Food 
##   -17.83215     2.93896

# Alternatively, you can extract the coefficients as follows 
coef(fit_simple)

## (Intercept)        Food 
##   -17.83215     2.93896

To get the residuals of this model, you can use the command

res <- fit_simple$residuals
# Alternatively, you can extract the residuals as follows 
res <- resid(fit_simple)

Also, you can get the predicted values of the prices (response variable), by typing either one of the following syntax

pred_simple <- fit_simple$fitted.values
# or 
pred_simple <- fitted(fit_simple) 
# or
pred_simple <- fitted.values(fit_simple)

Step 4: Confidence intervals

You can extract the 95% confidence intervals around the fitted coefficients (intercept and slop) using the following command

confint(fit_simple)

##                  2.5 %    97.5 %
## (Intercept) -29.408044 -6.256253
## Food          2.379464  3.498455

For a particular value (or values) of the independent variable Food, say $\color{blue}{23}$ and $\color{blue}{26}$, you can also construct the:

$\color{blue}{\text{Confidence interval for the mean value}}$ of the response variable by typing the following code:

predict(fit_simple, data.frame(Food = (c(23, 26))), interval = "confidence")

##        fit      lwr      upr
## 1 49.76393 48.02224 51.50561
## 2 58.58081 55.36096 61.80065

$\color{blue}{\text{Prediction interval for an individual}}$ value of of the response variable by typing the following code:

predict(fit_simple, data.frame(Food = (c(23, 26))), interval = "prediction")

##        fit      lwr      upr
## 1 49.76393 35.32324 64.20462
## 2 58.58081 43.88838 73.27324

Step 5: Draw the fitted line

You can use the package ggplot2 to draw the fitted line as follows

library("tidyverse")
ggplot(data = nyc, 
       aes(x = Food, y = Price)) + 
  geom_point(color='blue') +
  geom_smooth(method = "lm")

Note, in the previous code, the default of the syntax geom_smooth() is to se = TRUE which plots a 95% confidence interval around the smooth line. If you need to plot the line without the confidence limits, then you need to specify se = FALSE as follows

ggplot(data = nyc, 
       aes(x = Food, y = Price)) + 
  geom_point(color='blue') +
  geom_smooth(method = "lm", se = FALSE)

Step 6: Check for the validity of the assumptions

The residuals plots are useful to check for:

Homogeneity variance (homoscedasticity): the variance of the errors is constant.
Linearity: the relationships between the predictors and the response variable should be linear.
Independence: the errors are not correlated.
Normality: the errors are normally distributed.

par(mfrow=c(2,2)) # Split the plotting space in a matrix of 2 rows and 2 columns 
plot(fit_simple)

par(mfrow=c(1,1))

Linearity: The plot of residuals versus predicted values, shown in the upper left of the diagnostic plots, is useful for checking the assumption of linearity and homoscedasticity. If the model meets the linear model assumption, we would expect to see a random scatter plot with residual values between -2 and 2. The red curve, which represents the LOWESS fit, should show a flat horizontal line with a zero slope. On the other hand, if this assumption is not valid, we expect to see residuals that are very large (large positive or large negative values). Based on these notes, we may conclude that the assumption of linearity is valid.

Normality: The QQ-plot in the upper right diagnostic checks shows that the normality assumption does not hold (observations are far a way from the 45-degree line) and the distribution is skewed due to many outliers.

Constant variance (homoscedasticity): To assess if the homoscedasticity assumption is met we check, again, the plot in the upper left of the diagnostic plots. The model has a constant variance if is no pattern in the residuals and that they are equally spread around the horizontal line with a zero slope. Another way is to examine the lower left plot (scale-location plot). We also need to see random pattern in the residuals. Checking the pattern of this plot, we decide that the homoscedasticity (constant variance) assumption is valid.

Leverage and outliers: Recall an outlier point is an extreme response $Y$ value; high leverage is an extreme predictor $X$ value; and an influential point is a point that has a large impact on the regression. Outliers may or may not be influential points, but outliers and high leverage data points have the potential to be influential. Influential outliers are of the greatest concern. Any observation for which the Cook’s distance is greater than or equal to 1 requires investigation. In our case, we don’t have many observations with large Cook’s distances.

In summary, we conclude that the assumptions of the linear regression model are not severely violated and our model might be good to use for prediction (but needs to be improved!)

You can apply the functions $\color{blue}{\text{hatvalues()}}$ and $\color{blue}{\text{cooks.distance()}}$ on the fitted model to get the diagonal elements of the $\color{blue}{\text{hat matrix and Cook’s distance}}$, respectively. This helps you to detect both $\color{blue}{\text{leverage}}$ and $\color{blue}{\text{influence}}$ observations. The larger the values of $\color{blue}{\text{hatvalues()}}$ and $\color{blue}{\text{cooks.distance()}}$, the larger the influence on the prediction results.

Recall:

$\color{blue}{\text{leverage}}$ is unusual predictor value.
$\color{blue}{\text{influence}}$ is a value whose removal from the data set would cause a large change in the estimated regression model coefficients.
As a rule of thumb, we usually consider points for which the hat value $h_{ii} > 2(p+1)/n \approx 0.023809$, where $p$ is the number of predictors and $n$ is the sample size.
Also, a value of Cook’s distance $D_i> 4/(n-p-1) \approx 0.024096$ is considered to be influential.

nyc <- nyc %>% 
  mutate(nyc, 
         hat=hatvalues(fit_simple), 
         cooks=cooks.distance(fit_simple))
# which values can potentially seen as influential values
filter(nyc, hat > 2*2/168)       # 2*2/168 = 2*(p+1)/n

##    Case            Restaurant Price Food Decor Service East        hat
## 1    35               FELIDIA    62   25    22      23    1 0.03550704
## 2    50            Barbaresco    44   16    16      16    1 0.03811838
## 3    54          Il Vagabondo    36   17    14      17    1 0.02564196
## 4    68   Zucchero e Pomodori    29   17    14      15    1 0.02564196
## 5    69              Baraonda    37   17    18      15    1 0.02564196
## 6    88               Erminia    54   25    24      24    1 0.03550704
## 7   116  Supreme Macaroni Co.    28   17    14      17    0 0.02564196
## 8   123 Trattoria Dopo Teatro    38   17    16      17    0 0.02564196
## 9   155         Mangia e Bevi    29   17    14      16    0 0.02564196
## 10  159               Ernie's    31   16    15      16    0 0.03811838
## 11  164                  Baci    31   17    15      16    0 0.02564196
##           cooks
## 1  0.0146348414
## 2  0.0856919814
## 3  0.0038362589
## 4  0.0025099156
## 5  0.0060750751
## 6  0.0009758685
## 7  0.0043697767
## 8  0.0088262260
## 9  0.0025099156
## 10 0.0012784330
## 11 0.0003271975

filter(nyc, cooks > 4/(168 - 2)) # 4/(168 - 2) = 4/(n-p-1)

##   Case     Restaurant Price Food Decor Service East         hat      cooks
## 1   30 Harry Cipriani    65   21    20      20    1 0.006201944 0.02655107
## 2   50     Barbaresco    44   16    16      16    1 0.038118381 0.08569198
## 3   56          Nello    54   18    16      15    1 0.016212099 0.05693604
## 4   93    Sirabella's    36   23    16      22    1 0.014761352 0.02732359
## 5  115        Lamarca    19   18     9      15    1 0.016212099 0.04102331
## 6  117       Veronica    22   21     6      14    0 0.006201944 0.02852820
## 7  130  Rainbow Grill    65   19    23      18    0 0.009828812 0.06927196
## 8  132   San Domenico    65   23    22      22    0 0.014761352 0.03348105
## 9  168        Gennaro    34   24    10      16    0 0.023610910 0.08216634

You can also examine the $\color{blue}{\text{bubble plot (influence plot)}}$ that combines the standardized residuals and the hat-values. First, you can calculate the Cook’s distances by using the function $\color{blue}{\text{cooks.distance()}}$. Then you can use the $\color{blue}{\text{hatvalues()}}$ to compute the diagnostics.

std_resid <- rstandard(fit_simple)
cooks_D <- cooks.distance(fit_simple)
hat_values <- hatvalues(fit_simple)
plot(hat_values, std_resid, cex = 10*sqrt(cooks_D), xlab = "Hat values", ylab ="Residuals")
abline(h=c(-2.5, 2.5), lty=2)

There are apparently several data points that exhibit large influence in the simple linear regression.

Alternatively, you can use the car package to:

$\color{blue}{\textbf{Check for outliers}}$: You can use the function $\color{blue}{\text{outlierTest()}}$ build in R package car to test the hypotheses \[\begin{align*} H_0:&\quad \text{Data has no outliers}\\ H_A:&\quad \text{Data has at least one outlier} \end{align*}\]

library("car")
outlierTest(fit_simple)

##     rstudent unadjusted p-value Bonferroni p
## 130 3.891861         0.00014411      0.02421

The resulted p-value from the $\color{blue}{\text{outlierTest()}}$ function suggests that we have some outliers in the menu pricing. Observation $\color{blue}{130}$ is the most extreme observation in our data which corresponds to food rate = $\color{blue}{19}$.

Food[130]

## [1] 19

$\color{blue}{\textbf{Check for influential leverage points}}$

car::influencePlot(fit_simple)

##        StudRes         Hat       CookD
## 50   2.1008683 0.038118381 0.085691981
## 117 -3.1011799 0.006201944 0.028528202
## 130  3.8918613 0.009828812 0.069271964
## 159  0.2532911 0.038118381 0.001278433
## 168 -2.6538822 0.023610910 0.082166338

The plot suggests that the values $\color{blue}{50, 117, 130, 159}$ and $\color{blue}{168}$ of the response variable $Y$(Price) might correspond to influential points of $X$(Food).

$\color{blue}{\textbf{Check for heteroscedasticity}}$: The function $\color{blue}{\text{ncvTest()}}$ from the R package car is the score test that can be used to test for non-constant error variance. The null hypothesis \[H_0:\hbox{The errors have constant variance}\] is tested against the alternative hypothesis \[H_a:\hbox{The errors have non constant variances}\]

car::ncvTest(fit_simple)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.5388802, Df = 1, p = 0.4629

The large p-value ($p = 0.4629$) suggests that the variance is constant.

Multiple Linear Regression

Now, let’s fit multiple linear regressions using all predictors.

Step 1: Fit the multiple linear regression

Fit the multiple linear regression where you include all predictors (full model with no interaction) and get the output summary.

fit_full1 <- lm(Price ~ Food + Decor + Service + East)
summary(fit_full1)

## 
## Call:
## lm(formula = Price ~ Food + Decor + Service + East)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.0465  -3.8837   0.0373   3.3942  17.7491 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -24.023800   4.708359  -5.102 9.24e-07 ***
## Food          1.538120   0.368951   4.169 4.96e-05 ***
## Decor         1.910087   0.217005   8.802 1.87e-15 ***
## Service      -0.002727   0.396232  -0.007   0.9945    
## East          2.068050   0.946739   2.184   0.0304 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.738 on 163 degrees of freedom
## Multiple R-squared:  0.6279, Adjusted R-squared:  0.6187 
## F-statistic: 68.76 on 4 and 163 DF,  p-value: < 2.2e-16

From the output summary, we can write the initial full model without interaction as follows:

\[\widehat{\text{Price}} = -24.02 + 1.54 \times \text{Food} + 1.91 \times \text{Décor} - 0.003 \times \text{Service}+ 2.07 \times \text{East}\]

Décor has the largest effect on Price since its regression coefficient is largest. Also it is the most statistically significant since its $p$-value is the smallest of the three.
Be careful! in general we can’t compare the regression coefficients of the variable, but in this example we can! Why?
In order that the price achieved for dinner is maximized, the new restaurant should be on the east of Fifth Avenue since the coefficient of the dummy variable is statistically significantly larger than 0.
It does not seem possible to achieve a price premium for “setting a new standard for high quality service in Manhattan” for Italian restaurants since the regression coefficient of Service is not statistically significantly greater than zero.

Step 2: Fit a reduced model

Fit a reduced model removing the variable “Service” from the the previous model and get the output summary.

fit_reduced <- lm(Price ~ Food + Decor + East)
# another way to fit a reduced model is to use the function update() as follows
# m2 <- update(m1,~.-Service)
summary(fit_reduced)

## 
## Call:
## lm(formula = Price ~ Food + Decor + East)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.0451  -3.8809   0.0389   3.3918  17.7557 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -24.0269     4.6727  -5.142 7.67e-07 ***
## Food          1.5363     0.2632   5.838 2.76e-08 ***
## Decor         1.9094     0.1900  10.049  < 2e-16 ***
## East          2.0670     0.9318   2.218   0.0279 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.72 on 164 degrees of freedom
## Multiple R-squared:  0.6279, Adjusted R-squared:  0.6211 
## F-statistic: 92.24 on 3 and 164 DF,  p-value: < 2.2e-16

The updated (reduced) regression model is \[\widehat{\text{Price}} = -24.03 + 1.54 \times \text{Food} + 1.91 \times \text{Décor} + 2.07 \times \text{East}\]

Comparing the last two sets of output from R, we see that the regression coefficients for the variables in both models are very similar. $\color{red}{\text{Note that this does not always occur!}}$

Step 3: Fit an interaction model

We wonder whether the restaurants on the east side of Fifth Avenue are very different from those on the west side with service and Décor thought to be more important on the east of Fifth Avenue. Thus, to investigate whether the effect of the predictors depends on the dummy variable East, we consider the extended model (full model): \[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 D + \beta_5 \times X_1 \times D + \beta_6 \times X_2 \times D + \beta_7 \times X_3 \times D +\varepsilon, \] where

$Y=$ Price,
$X_1=$ Food,
$X_2=$ Décor,
$X_3=$ Service,
$D=$ East (dummy variable of two values: 0 and 1).

We test the hypothesis \[ H_0: \beta_3=\beta_5=\beta_6=\beta_7=0 \] against \[ H_A: H_0 \text { is not true } \] i.e., we test the reduced model \[ Y=\beta_0+\beta_1 x_1+\beta_2 x_2 +\beta_4\times D +\varepsilon \]

against the full model with interaction \[ Y=\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 D + \beta_5 \times X_1 \times D + \beta_6 \times X_2 \times D + \beta_7 \times X_3 \times D +\varepsilon \] Below is the code and output summary for fitting a full linear model taking into account the interactions due to location variable.

fit_full2 <- lm(Price~Food+Decor+Service+East+Food:East+Decor:East+Service:East)
# Alternatively, you can use * operator to shorthand the previous command as follows:
# fit_full2 <- lm(Price~ Food*East + Decor*East+ Service*East)
summary(fit_full2)

## 
## Call:
## lm(formula = Price ~ Food + Decor + Service + East + Food:East + 
##     Decor:East + Service:East)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5099  -3.7996  -0.1413   3.6522  17.1656 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -26.9949     8.4672  -3.188  0.00172 ** 
## Food           1.0068     0.5704   1.765  0.07946 .  
## Decor          1.8881     0.2984   6.327  2.4e-09 ***
## Service        0.7438     0.6443   1.155  0.25001    
## East           6.1253    10.2499   0.598  0.55095    
## Food:East      1.2077     0.7743   1.560  0.12079    
## Decor:East    -0.2500     0.4570  -0.547  0.58510    
## Service:East  -1.2719     0.8171  -1.557  0.12151    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.713 on 160 degrees of freedom
## Multiple R-squared:  0.6379, Adjusted R-squared:  0.622 
## F-statistic: 40.27 on 7 and 160 DF,  p-value: < 2.2e-16

Step 4: Compare between the fitted models

We also use the function $\color{blue}{\text{anova()}}$ to compare between the reduced and full models.

anova(fit_reduced,fit_full2)

## Analysis of Variance Table
## 
## Model 1: Price ~ Food + Decor + East
## Model 2: Price ~ Food + Decor + Service + East + Food:East + Decor:East + 
##     Service:East
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    164 5366.5                           
## 2    160 5222.2  4    144.36 1.1057 0.3558

The F-statistic, from the output summary in step 3, for comparing the reduced and full models based on ANOVA is given by \[F=\frac{(RSS(\text{reduced})-RSS(\text{full}))/(df(\text{reduced})-df(\text{full}))}{RSS(\text{full})/df_{\text{full}}}\approx 1.11\] The $p-$value of ANOVA test equals $0.36$. Thus, we can’t adopt the full model and we conclude that the reduced final model

\[\widehat{\text{Price}} = -24.03 + 1.54 \times \text{Food} + 1.91 \times \text{Décor} + 2.07 \times \text{East}\] is a good to be adopted.

Step 5: Check for the validity of the model assumptions

You can check for $\color{blue}{\text{homoscedasticity, linearity, independency and normality}}$ of your residuals obtained by your fitted model as we did with the simple linear regression model.

par(mfrow=c(2,2))
plot(fit_reduced)

par(mfrow=c(1,1))

Alternatively, you can use the ggplot2 package to reproduce the previous plots as follows

# add standardized residuals as a new variable with the name "res.std" to the nyc data frame
nyc <- nyc %>% 
  mutate(res.std = rstandard(fit_reduced))
# one way to produce ggplot is to include data frame and aes() within the ggplot() function, leaving geom_point() empty as follows
p1 <- ggplot(nyc, aes(Food, res.std)) + 
  geom_point() + 
  ylab("Standardized Residuals") + 
  theme_bw()
# another way is to include data frame name within ggplot() and aes() within geom_point() as follows
p2 <- ggplot(nyc) + 
  geom_point(aes(Decor, res.std)) + 
  ylab("Standardized Residuals") + 
  theme_bw()
# one way to produce ggplot is to include data frame and aes() within the geom_point() function, leaving ggplot() empty as follows
p3 <- ggplot() + 
  geom_point(data = nyc, aes(Service, res.std))+
  ylab("Standardized Residuals") + 
  theme_bw()
# you can also start with the data frame and %>% and then add the layouts of the ggplot() as follows
p4 <- nyc %>% ggplot() + 
  geom_point(aes(East,res.std)) 
 ylab("Standardized Residuals") + 
  theme_bw()

## NULL

# USe the grid.arrange() function in the "gridExtra" package to arrange multiple ggplots on one page 
library("gridExtra") 
layout <- rbind(c(1,2),c(3,4))
grid.arrange(grobs=list(p1,p2,p3,p4),
             ncol=2,
             layout_matrix=layout)

Step 6: Test for multicollinearity (only for multiple linear models)

One of the assumptions that the multiple linear regression model satisfies is that the predictors are uncorrelated. This assumption can be checked using the $j$ th $\color{blue}{\text{variance inflation factor (VIF)}}$ given by
\[ 1 /\left(1-R_j^2\right),\quad j = 1,2,\cdots,p \] where $R_j^2$ denote the value of $\color{blue}{\text{coefficient of determination, } R^2},$ obtained from the regression of $x_{\mathrm{j}}$ on the other $x^{\prime}$ ’s (i.e., the amount of variability explained by this regression).

As a rule of thumb:

$\color{blue}{\text{VIF}< 5}$ is an indicative of $\color{blue}{\text{no}}$ multicollinearity.
$\color{blue}{5\le\text{VIF}<10}$ is an indicative of $\color{blue}{\text{minor}}$ multicollinearity.
$\color{blue}{\text{VIF}\ge 10}$ is an indicative of $\color{blue}{\text{serious}}$ multicollinearity.

In R package car, the function $\color{blue}{\text{vif()}}$ can be used to calculate the VIF for this data set.

car::vif(fit_reduced)

##     Food    Decor     East 
## 1.389515 1.346030 1.038000

We noticed that all variance inflation factors less than 5 and so no multicollinearity is detected.

Stepwise Regression

The $\color{blue}{\text{step()}}$ function in base R performs stepwise model selection (forward, backward, or stepwise) using an $\color{blue}{\text{AIC}}$ criterion.

Backward stepwise regression:

You start from the full model (all variables in the model) deleting predictor variables one at a time, and stopping when the removing of variables would no longer improve the model.

full.model <- lm(Price~ Food*East + Decor*East+ Service*East)
backward <- step(full.model, direction="backward")

## Start:  AIC=593.37
## Price ~ Food * East + Decor * East + Service * East
## 
##                Df Sum of Sq    RSS    AIC
## - East:Decor    1     9.768 5231.9 591.68
## <none>                      5222.2 593.37
## - East:Service  1    79.096 5301.3 593.89
## - Food:East     1    79.406 5301.6 593.90
## 
## Step:  AIC=591.68
## Price ~ Food + East + Decor + Service + Food:East + East:Service
## 
##                Df Sum of Sq    RSS    AIC
## <none>                      5231.9 591.68
## - Food:East     1     78.49 5310.4 592.18
## - East:Service  1    134.57 5366.5 593.95
## - Decor         1   2027.89 7259.8 644.71

summary(backward)

## 
## Call:
## lm(formula = Price ~ Food + East + Decor + Service + Food:East + 
##     East:Service)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.785  -3.942  -0.027   3.593  16.869 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -26.7229     8.4342  -3.168  0.00183 ** 
## Food           0.9532     0.5607   1.700  0.09106 .  
## East           5.7915    10.2094   0.567  0.57132    
## Decor          1.7815     0.2255   7.900 4.13e-13 ***
## Service        0.8861     0.5881   1.507  0.13387    
## Food:East      1.2006     0.7725   1.554  0.12211    
## East:Service  -1.4760     0.7253  -2.035  0.04349 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.701 on 161 degrees of freedom
## Multiple R-squared:  0.6372, Adjusted R-squared:  0.6237 
## F-statistic: 47.13 on 6 and 161 DF,  p-value: < 2.2e-16

Forward stepwise regression:

You start from the null model (only intercept with no variables in the model) adding predictor variables to the model one at a time, and stopping when the addition of variables would no longer improve the model.

The $\color{blue}{\text{step()}}$ function in base R performs the forward stepwise model using an $\color{blue}{\text{AIC}}$ criterion as seen follows:

In this example, this can be done using the function $\color{blue}{\text{step()}}$

null.model <- lm(Price ~ 1, data = nyc)
forward <- step(null.model, direction="forward",
                scope = formula(full.model),trace = 0)
summary(forward)

## 
## Call:
## lm(formula = Price ~ Decor + Food + East, data = nyc)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.0451  -3.8809   0.0389   3.3918  17.7557 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -24.0269     4.6727  -5.142 7.67e-07 ***
## Decor         1.9094     0.1900  10.049  < 2e-16 ***
## Food          1.5363     0.2632   5.838 2.76e-08 ***
## East          2.0670     0.9318   2.218   0.0279 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.72 on 164 degrees of freedom
## Multiple R-squared:  0.6279, Adjusted R-squared:  0.6211 
## F-statistic: 92.24 on 3 and 164 DF,  p-value: < 2.2e-16

Stepwise regression (both):

In both stepwise regression, you combine the forward and backward stepwise approaches. Variables are entered one at a time, but at each step, the variables in the model are reevaluated, and those that don’t contribute to the model are deleted. A predictor variable may be added to, and deleted from, a model several times before a final solution is reached.

both <- step(null.model, direction="both",
             scope = formula(full.model),trace = 0)
summary(both)

## 
## Call:
## lm(formula = Price ~ Decor + Food + East, data = nyc)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.0451  -3.8809   0.0389   3.3918  17.7557 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -24.0269     4.6727  -5.142 7.67e-07 ***
## Decor         1.9094     0.1900  10.049  < 2e-16 ***
## Food          1.5363     0.2632   5.838 2.76e-08 ***
## East          2.0670     0.9318   2.218   0.0279 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.72 on 164 degrees of freedom
## Multiple R-squared:  0.6279, Adjusted R-squared:  0.6211 
## F-statistic: 92.24 on 3 and 164 DF,  p-value: < 2.2e-16

Last Step: It is always recommended to detach the data using the function $\color{blue}{\text{detach()}}$.

detach(nyc)