Stepwise Regression in R

Stepwise regression is a systematic method for adding or removing predictor variables from a multiple regression model. It is an iterative process that begins with an initial model and then explores potential improvements by adding or removing variables based on their statistical significance.

Stepwise regression is used in statistical modeling for several reasons:

Variable selection: It helps identify the most relevant predictor variables that have a significant impact on the response variable while excluding irrelevant or redundant variables. This can improve model interpretability and reduce overfitting.
Model simplification: By removing insignificant variables, stepwise regression can simplify the model, which can improve its generalization performance on new data.
Exploratory analysis: Stepwise regression can be used as an exploratory tool to gain insights into the relationships between variables and to generate hypotheses for further investigation.
Computational efficiency: In situations where there are many potential predictor variables, stepwise regression can be computationally more efficient than evaluating all possible combinations of variables.

Overview of Stepwise Regression Methods

There are three main types of stepwise regression methods:

Forward Selection: This method starts with an empty model and sequentially adds variables based on their statistical significance.
Backward Elimination: This method starts with a full model containing all predictor variables and sequentially removes variables that are insignificant.
Stepwise Selection: This method is a combination of forward selection and backward elimination, where variables can be added or removed at each step.

Forward Selection

In forward selection, we start with a null model (a model with no predictor variables) and iteratively add variables to the model based on their statistical significance. Here’s an example in R:

# Load the required dataset
data(longley, package = "datasets")

# Fit the initial null model
null_model <- lm(Employed ~ 1, data = longley)

# Perform forward selection
forward_model <- step(null_model, scope = list(lower = ~ 1, upper = ~ . - 1), 
                      direction = "forward")

# Print the summary of the selected model
summary(forward_model)

Output:

Start:  AIC=41.17
Employed ~ 1

Call:
lm(formula = Employed ~ 1, data = longley)

Residuals:
   Min     1Q Median     3Q    Max 
-5.146 -2.604  0.187  2.974  5.234 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   65.317      0.878   74.39   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.512 on 15 degrees of freedom

In the code above, we’re using the longley dataset, which is a built-in dataset in R containing several economic variables. We first fit a null model using lm() with only the intercept. Then, we use the step() function with the direction = “forward” argument to perform forward selection.

The scope argument specifies the range of models to be considered, where lower = ~ 1 represents the null model, and upper = ~ . – 1 represents the full model with all predictor variables except the response variable (Employed).
The step() function will iteratively add significant variables to the model until it reaches the optimal model based on the specified criteria (e.g., AIC, BIC).

The output indicates that the mean value of the “Employed” variable in the “longley” dataset is approximately 65.317. This intercept-only model does not include any predictors, so it simply represents the overall average employment across all observations.

Backward Elimination

In backward elimination, we start with a full model containing all predictor variables and iteratively remove variables that are insignificant. Here’s an example in R.

# Load the required dataset
data(longley, package = "datasets")

# Fit the initial full model
initial_model <- lm(Employed ~ ., data = longley)

# Perform backward elimination
backward_model <- step(initial_model, direction = "backward")

# Print the summary of the selected model
summary(backward_model)

Output:

Start:  AIC=-33.22
Employed ~ GNP.deflator + GNP + Unemployed + Armed.Forces + Population + 
    Year

               Df Sum of Sq     RSS     AIC
- GNP.deflator  1   0.00292 0.83935 -35.163
- Population    1   0.00475 0.84117 -35.129
- GNP           1   0.10631 0.94273 -33.305
<none>                      0.83642 -33.219
- Year          1   1.49881 2.33524 -18.792
- Unemployed    1   1.59014 2.42656 -18.178
- Armed.Forces  1   2.16091 2.99733 -14.798

Step:  AIC=-35.16
Employed ~ GNP + Unemployed + Armed.Forces + Population + Year

               Df Sum of Sq    RSS     AIC
- Population    1   0.01933 0.8587 -36.799
<none>                      0.8393 -35.163
- GNP           1   0.14637 0.9857 -34.592
- Year          1   1.52725 2.3666 -20.578
- Unemployed    1   2.18989 3.0292 -16.628
- Armed.Forces  1   2.39752 3.2369 -15.568

Step:  AIC=-36.8
Employed ~ GNP + Unemployed + Armed.Forces + Year

               Df Sum of Sq    RSS     AIC
<none>                      0.8587 -36.799
- GNP           1    0.4647 1.3234 -31.879
- Year          1    1.8980 2.7567 -20.137
- Armed.Forces  1    2.3806 3.2393 -17.556
- Unemployed    1    4.0491 4.9077 -10.908


Call:
lm(formula = Employed ~ GNP + Unemployed + Armed.Forces + Year, 
    data = longley)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42165 -0.12457 -0.02416  0.08369  0.45268 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -3.599e+03  7.406e+02  -4.859 0.000503 ***
GNP          -4.019e-02  1.647e-02  -2.440 0.032833 *  
Unemployed   -2.088e-02  2.900e-03  -7.202 1.75e-05 ***
Armed.Forces -1.015e-02  1.837e-03  -5.522 0.000180 ***
Year          1.887e+00  3.828e-01   4.931 0.000449 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2794 on 11 degrees of freedom
Multiple R-squared:  0.9954,    Adjusted R-squared:  0.9937 
F-statistic: 589.8 on 4 and 11 DF,  p-value: 9.5e-13

The final model, selected through stepwise backward elimination, explain an excellent fit to the data, explaining almost all the variability in the employment variable. Each of the predictors—GNP, Unemployed, Armed Forces, and Year—significantly impacts employment, with Year having a positive effect while the others (GNP, Unemployed, and Armed Forces) have negative effects. The high significance levels of these predictors (all with p-values < 0.05) and the overall model suggest that these variables are strong determinants of employment in the context of the dataset used.

Stepwise Selection

Stepwise selection is a combination of forward selection and backward elimination. Variables can be added or removed at each step based on their statistical significance. Here’s an example in R:

# Load the required dataset
data(longley, package = "datasets")

# Fit the initial full model
initial_model <- lm(Employed ~ ., data = longley)

# Perform stepwise selection
stepwise_model <- step(initial_model, direction = "both")

# Print the summary of the selected model
summary(stepwise_model)

Output:

Start:  AIC=-33.22
Employed ~ GNP.deflator + GNP + Unemployed + Armed.Forces + Population + 
    Year

               Df Sum of Sq     RSS     AIC
- GNP.deflator  1   0.00292 0.83935 -35.163
- Population    1   0.00475 0.84117 -35.129
- GNP           1   0.10631 0.94273 -33.305
<none>                      0.83642 -33.219
- Year          1   1.49881 2.33524 -18.792
- Unemployed    1   1.59014 2.42656 -18.178
- Armed.Forces  1   2.16091 2.99733 -14.798

Step:  AIC=-35.16
Employed ~ GNP + Unemployed + Armed.Forces + Population + Year

               Df Sum of Sq    RSS     AIC
- Population    1   0.01933 0.8587 -36.799
<none>                      0.8393 -35.163
- GNP           1   0.14637 0.9857 -34.592
+ GNP.deflator  1   0.00292 0.8364 -33.219
- Year          1   1.52725 2.3666 -20.578
- Unemployed    1   2.18989 3.0292 -16.628
- Armed.Forces  1   2.39752 3.2369 -15.568

Step:  AIC=-36.8
Employed ~ GNP + Unemployed + Armed.Forces + Year

               Df Sum of Sq    RSS     AIC
<none>                      0.8587 -36.799
+ Population    1    0.0193 0.8393 -35.163
+ GNP.deflator  1    0.0175 0.8412 -35.129
- GNP           1    0.4647 1.3234 -31.879
- Year          1    1.8980 2.7567 -20.137
- Armed.Forces  1    2.3806 3.2393 -17.556
- Unemployed    1    4.0491 4.9077 -10.908


Call:
lm(formula = Employed ~ GNP + Unemployed + Armed.Forces + Year, 
    data = longley)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42165 -0.12457 -0.02416  0.08369  0.45268 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -3.599e+03  7.406e+02  -4.859 0.000503 ***
GNP          -4.019e-02  1.647e-02  -2.440 0.032833 *  
Unemployed   -2.088e-02  2.900e-03  -7.202 1.75e-05 ***
Armed.Forces -1.015e-02  1.837e-03  -5.522 0.000180 ***
Year          1.887e+00  3.828e-01   4.931 0.000449 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2794 on 11 degrees of freedom
Multiple R-squared:  0.9954,    Adjusted R-squared:  0.9937 
F-statistic: 589.8 on 4 and 11 DF,  p-value: 9.5e-13

The selected linear model provides a highly accurate prediction of employment levels (“Employed”) using the predictors “GNP”, “Unemployed”, “Armed Forces”, and “Year”. Each predictor significantly affects employment, with “Year” showing a positive influence while “GNP”, “Unemployed”, and “Armed Forces” have negative influences. The model’s high R-squared and adjusted R-squared values indicate it explains nearly all the variability in employment data, making it a robust and reliable model for understanding the factors influencing employment in the context of the “longley” dataset. Future analyses could focus on exploring potential interactions between predictors or including additional relevant variables to further refine the model.