Data Preprocessing
Data preprocessing, which involves preparing raw data for analysis and modeling, is an essential stage in the pipeline for data analysis and machine learning. It plays a crucial part in raising the accuracy and dependability of the data, which eventually improves the effectiveness of machine learning models. Let’s see how to perform it:
Log Transformation and Distribution Plot
Python3
# Apply the natural logarithm transformation to the 'charges' column df[ 'charges' ] = np.log1p(df[ 'charges' ]) # Create a distribution plot for the transformed 'charges' column sb.distplot(df[ 'charges' ]) # Display the distribution plot plt.show() |
Output:
The ‘np.log1p’ function is used in this code to apply the natural logarithm transformation on the ‘charges’ column of the DataFrame (‘df’). The skewness in the data distribution is lessened because to this treatment. The distribution of values following the logarithmic transformation is then shown visually by a distribution plot (histogram) for the changed “charges” column made using “sb.distplot.” The distribution plot, which offers details on the altered data distribution, is then shown. The age and the bmi data is normally distributed but the charges are left skewed. We can perform logarithmic transformation to this dataset to convert it into normally distributed values.
One-Hot Encoding Categorical Columns
Python3
# Mapping Categorical to Numerical Values # Map 'sex' column values ('male' to 0, 'female' to 1) df[ 'sex' ] = df[ 'sex' ]. map ({ 'male' : 0 , 'female' : 1 }) # Map 'smoker' column values ('no' to 0, 'yes' to 1) df[ 'smoker' ] = df[ 'smoker' ]. map ({ 'no' : 0 , 'yes' : 1 }) # Display the DataFrame's first few rows to show the transformations df.head() |
Output:
age sex bmi children smoker charges northeast northwest \
0 19 NaN 27.900 0 NaN 9.734236 0 0
1 18 NaN 33.770 1 NaN 7.453882 0 0
2 28 NaN 33.000 3 NaN 8.400763 0 0
3 33 NaN 22.705 0 NaN 9.998137 0 1
4 32 NaN 28.880 0 NaN 8.260455 0 1
southeast southwest
0 0 1
1 1 0
2 1 0
3 0 0
4 0 0
This code performs categorical-to-numerical mapping for the ‘sex’ and ‘smoker’ columns, making the data suitable for machine learning algorithms that require numerical input. It also displays the initial rows of the DataFrame to illustrate the changes.
One-hot encoding for “Region” column
Python3
# Perform one-hot encoding on the 'region' column temp = pd.get_dummies(df[ 'region' ]).astype( 'int' ) # Concatenate the one-hot encoded columns with the original DataFrame df = pd.concat([df, temp], axis = 1 ) |
This code applies one-hot encoding to the’region’ column, turning categorical region values into binary columns that each represent a distinct region. The dataset is expanded with binary features for each region by concatenating the resulting one-hot encoded columns with the original DataFrame.
Python3
# Remove 'Id' and 'region' columns from the DataFrame df.drop([ 'Id' , 'region' ], inplace = True , axis = 1 ) # Display the updated DataFrame print (df.head()) |
Output:
age sex bmi children smoker charges northeast northwest \
0 19 1 27.900 0 1 9.734236 0 0
1 18 0 33.770 1 0 7.453882 0 0
2 28 0 33.000 3 0 8.400763 0 0
3 33 0 22.705 0 0 9.998137 0 1
4 32 0 28.880 0 0 8.260455 0 1
southeast southwest
0 0 1
1 1 0
2 1 0
3 0 0
4 0 0
Now the only remaining column(categorical) is the region column let’s one hot encode this as the number of category in this column is more than 2 and nomialy encoding it will imply that we are giving preference without knowing the reality.
Splitting Data
Python3
# Define the features features = df.drop( 'charges' , axis = 1 ) # Define the target variable as 'charges' target = df[ 'charges' ] # Split the data into training and validation sets X_train, X_val, Y_train, Y_val = train_test_split(features, target, random_state = 2023 , test_size = 0.25 ) # Display the shapes of the training and validation sets X_train.shape, X_val.shape |
Output:
((1003, 11), (335, 11))
To evaluate the performance of the model while the training process goes on let’s split the dataset in 75:25 ratio and then use it to create lgb dataset and then train the model.
Feature scaling
Python3
# Standardize Features # Use StandardScaler to scale the training and validation data scaler = StandardScaler() #Fit the StandardScaler to the training data scaler.fit(X_train) # transform both the training and validation data X_train = scaler.transform(X_train) X_val = scaler.transform(X_val) |
This code fits the StandardScaler to the training data to calculate the mean and standard deviation and then transforms both the training and validation data using these calculated values to ensure consistent scaling between the two datasets.
Dataset Preparation
Python3
# Create a LightGBM dataset for training with features X_train and labels Y_train train_data = lgb.Dataset(X_train, label = Y_train) # Create a LightGBM dataset for testing with features X_val and labels Y_val, # and specify the reference dataset as train_data for consistent evaluation test_data = lgb.Dataset(X_val, label = Y_val, reference = train_data) |
Now, by using the training and the validation data let’s create the training and the validation data using lgb.Dataset. Here it prepares the data for training and testing with lightGBM by creating dataset objects using provided features and labels.
Regression using LightGBM
In this article, we will learn about one of the state-of-the-art machine learning models: Lightgbm or light gradient boosting machine. After improvising more and more on the XGB model for better performance XGBoost which is an eXtreme Gradient Boosting machine but by the lightgbm we can achieve similar or better results without much computing and train our model on an even bigger dataset in less time. Let’s see what is LightGBM and how we can perform regression using LightGBM.
Table of Content
- What is LightGBM?
- How LightGBM Works?
- Implementation of LightBGM
- Exploratory Data Analysis
- Data Preprocessing
- Regression Model using LightGBM
- Conclusion