Handling Categorical Data using One-Hot Encoding
One-hot encoding is a technique is a method utilized for expressing categorical variables as binary vectors. In this encoding scheme, each category is transformed into a binary vector where all elements are zero except for the one corresponding to the category’s index.
Imagine you have a list of fruits: apples, bananas, and oranges. Now, One-Hot Encoding is like making a checklist for each fruit. If an apple is on the list, you put a check in the “apple” column and leave the others blank. If it’s a banana, you check the “banana” column, and so on. So, instead of using numbers, we create separate columns for each fruit. If the fruit is there, the column gets a check (1); if not, it stays blank (0). This way, the computer knows exactly which fruits are present without getting confused about which one is “bigger” or “smaller.” Each fruit gets its own space on the checklist.
In the code snippet, the categorical data is reshaped into a 2D array because the OneHotEncoder
in scikit-learn expects its input to be a 2D array or sparse matrix.
After, the original categorical data is transformed into a sparse matrix of one-hot encoded values.
Python3
from sklearn.preprocessing import OneHotEncoder import pandas as pd colors = [ 'red' , 'blue' , 'green' , 'yellow' , 'blue' , 'green' ] # Reshape the data to a 2D array (required by OneHotEncoder) colors_reshaped = pd.DataFrame(colors, columns = [ 'Color' ]) onehot_encoder = OneHotEncoder(sparse = False , drop = 'first' ) # 'first' to drop the first category to avoid multicollinearity onehot_encoded = onehot_encoder.fit_transform(colors_reshaped) # Fit and transform the data onehot_encoded_df = pd.DataFrame(onehot_encoded, columns = onehot_encoder.get_feature_names_out([ 'Color' ])) print ( "Original Colors:" ) print (colors_reshaped) print ( "\nOne-Hot Encoded Colors:" ) print (onehot_encoded_df) |
Output:
Original Colors:
Color
0 red
1 blue
2 green
3 yellow
4 blue
5 green
One-Hot Encoded Colors:
Color_green Color_red Color_yellow
0 0.0 1.0 0.0
1 0.0 0.0 0.0
2 1.0 0.0 0.0
3 0.0 0.0 1.0
4 0.0 0.0 0.0
5 1.0 0.0 0.0
Passing categorical data to Sklearn Decision Tree
Theoretically, decision trees are capable of handling numerical as well as categorical data, but, while implementing, we need to prepare the data for classification. There are two methods to handle the categorical data before training: one-hot encoding and label encoding. In this article, we understand how each method helps in converting categorical data and difference between both.