Cross-Validating Different Regression Models Using K-Fold (California Housing Dataset)

Now it’s time to cross-validate different regression models using K-Fold, and we can analyze the performance of each model. Let’s make use of the California Housing dataset from Sklearn. The code is as follows:

Python

from sklearn.datasets import fetch_california_housing

# fetch california housing data
housing = fetch_california_housing()
print("Dataset Shape:", housing.data.shape, housing.target.shape)
print("Dataset Features:", housing.feature_names)

Output

Dataset Shape: (20640, 8) (20640,)
Dataset Features: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 
'Population', 'AveOccup', 'Latitude', 'Longitude']

Here we make use of the fetch_california_housing() method from the sklearn dataset. The dataset consists of 20,640 samples and 9 features (including the label).

Here, the dataset contains only numerical features, and there are no missing values. So we don’t need to deal with text attributes or missing values; all we need to do is scale the features.

Let’s scale the features and apply K-Fold to the dataset.

Python

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold
import numpy as np


X_housing = housing.data
y_housing = housing.target

# Scaling the data
scaler = StandardScaler()
X_scaler = scaler.fit_transform(X_housing)

# K-Fold split
cnt = 0
n_splits = 10
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
for train_index, test_index in kf.split(X_scaler, y_housing):
    print(f'Fold:{cnt}, Train set: {len(train_index)}, \
    Test set:{len(test_index)}')
    cnt += 1

Output

Fold:0, Train set: 18576,     Test set:2064
Fold:1, Train set: 18576,     Test set:2064
Fold:2, Train set: 18576,     Test set:2064
Fold:3, Train set: 18576,     Test set:2064
Fold:4, Train set: 18576,     Test set:2064
Fold:5, Train set: 18576,     Test set:2064
Fold:6, Train set: 18576,     Test set:2064
Fold:7, Train set: 18576,     Test set:2064
Fold:8, Train set: 18576,     Test set:2064
Fold:9, Train set: 18576,     Test set:2064

Here, we scaled the features using the StandardScaler() method from Sklearn and passed the scaled features to the fit_transform() method. Then we prepared the K-Fold validation procedure, where we set the folds as 10 and mixed the dataset by setting the shuffle parameter as true.

Let’s visualise the split using matplotlib.

Python

fig, ax = plt.subplots(figsize=(6, 3))
plot_kfold(kf, X_scaler, y_housing, ax, n_splits, xlim_max=2000)
# Make the legend fit
plt.tight_layout()
fig.subplots_adjust(right=0.7)

Output

K-Fold with Shuffle

We make use of the same plot_cv_indices() method (explained above) to visualize the data split. Hope you noticed that in the above plot diagram, the training and test sets got shuffled up. This is because we set the shuffle parameter in K-Fold as true. This helps in considering data from different section.

Now let’s create different regression models and apply K-fold cross validation. The code is as follows:

Python

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import cross_val_score

def cross_validation(reg_model, housing_prepared, housing_labels, cv):
    scores = cross_val_score(
      reg_model, housing_prepared,
      housing_labels,
      scoring="neg_mean_squared_error", cv=cv)
    rmse_scores = np.sqrt(-scores)
    print("Scores:", rmse_scores)
    print("Mean:", rmse_scores.mean())
    print("StandardDeviation:", rmse_scores.std())

print("----- Linear Regression Model Cross Validation ------")
lin_reg = LinearRegression()
cross_validation(lin_reg, X_scaler, y_housing, kf)
print("")
print("----- Decision Tree Regression Model Cross Validation ------")
tree_reg = DecisionTreeRegressor()
cross_validation(tree_reg, X_scaler, y_housing, kf)
print("")
print("----- Random Forest Regression Model Cross Validation ------")
forest_reg = RandomForestRegressor()
cross_validation(forest_reg, X_scaler, y_housing, kf)

Output

----- Linear Regression Model Cross Validation ------
Scores: [0.74766431 0.74372259 0.6936579  0.75776228 0.69926807 0.72690314
 0.74241379 0.68908607 0.75124511 0.74163695]
Mean: 0.7293360220706322
StandardDeviation: 0.02440550831772841

----- Decision Tree Regression Model Cross Validation ------
Scores: [0.69024329 0.71299152 0.72902583 0.74687543 0.73311366 0.70912615
 0.71031728 0.70438177 0.71907938 0.74508813]
Mean: 0.7200242426779767
StandardDeviation: 0.01731035436143824

----- Random Forest Regression Model Cross Validation ------
Scores: [0.50050277 0.49624521 0.47534694 0.522097   0.48679587 0.51611116
 0.48861124 0.46187822 0.50740703 0.50927282]
Mean: 0.4964268280240172
StandardDeviation: 0.017721367101897926

In the above code, we created three different regression models (Linear, Decision Tree and Random Forest regression) and identified the prediction error using cross-validation for each model. The cross_val_score() method makes use of neg_mean_squared_error as an evaluation metric (scoring parameter) and K-fold as the cross-validation procedure. Here, we randomly split the training set into 10 distinct subsets called folds. So the K-Fold cross-validation feature can train and evaluate the model 10 times by picking a different fold each time and training on the other 9 folds.

You can notice that the decision tree has a mean prediction error of $72002, whereas the linear regression score is $72933. The Random Forest Regressor seems to be a promising model with a prediction error of $49642.

Once you have identified a promising model, you can fine tune the particular model and increase the model performance .

Cross-Validation Using K-Fold With Scikit-Learn

Cross-validation involves repeatedly splitting data into training and testing sets to evaluate the performance of a machine-learning model. One of the most commonly used cross-validation techniques is K-Fold Cross-Validation. In this article, we will explore the implementation of K-Fold Cross-Validation using Scikit-Learn, a popular Python machine-learning library.

Table of Content

What is K-Fold Cross Validation?
K-Fold With Scikit-Learn
Visualizing K-Fold Cross-Validation Behavior
Logistic Regression Model & K-Fold Cross Validating
Cross-Validating Different Regression Models Using K-Fold (California Housing Dataset)
Advantages & Disadvantages of K-Fold Cross Validation
Additional Information
Conclusions
Frequently Asked Questions (FAQs)

Cross-Validating Different Regression Models Using K-Fold (California Housing Dataset)

Cross-Validation Using K-Fold With Scikit-Learn

Categories

Contact US

Cross-Validating Different Regression Models Using K-Fold (California Housing Dataset)

Cross-Validation Using K-Fold With Scikit-Learn

Similar Reads

Categories

Contact US