Handling Categorical Data using Label Encoding

Categorical data like colors or categories, are convert into numerical values. Each category is assigned a unique code, enabling the computer to understand and process the information. For example, instead of using red or blue, we represent them with numbers, such as 1 for red and 2 for blue.

Label encoding involves converting categorical data into numerical format by assigning a distinct integer label to each category or class. In this encoding scheme, each unique category is mapped to an integer, making it easier for machine learning models to process and analyze the data.

Example: Code Implementation

Python3




from sklearn.preprocessing import LabelEncoder
colors = ['red', 'blue', 'green', 'yellow', 'blue', 'green'] # sample
label_encoder = LabelEncoder()
 
encoded_colors = label_encoder.fit_transform(colors) # Fit and transform the data
print("Original Colors:", colors)
print("Encoded Colors:", encoded_colors)


Output:

Original Colors: ['red', 'blue', 'green', 'yellow', 'blue', 'green']
Encoded Colors: [2 0 1 3 0 1]

Passing categorical data to Sklearn Decision Tree

Theoretically, decision trees are capable of handling numerical as well as categorical data, but, while implementing, we need to prepare the data for classification. There are two methods to handle the categorical data before training: one-hot encoding and label encoding. In this article, we understand how each method helps in converting categorical data and difference between both.

Similar Reads

Role of Categorical Data on Decision Tree Performance

The role of categorical data in decision tree performance is significant and has implications for how the tree structures are formed and how well the model generalizes to new data. Decision trees, being a non-linear model, can handle both numerical and categorical features. The treatment of categorical data becomes crucial during the tree-building process....

Handling Categorical Data using Label Encoding

Categorical data like colors or categories, are convert into numerical values. Each category is assigned a unique code, enabling the computer to understand and process the information. For example, instead of using red or blue, we represent them with numbers, such as 1 for red and 2 for blue....

Handling Categorical Data using One-Hot Encoding

...

Label Encoding vs. One-Hot Encoding for Decision Trees

One-hot encoding is a technique is a method utilized for expressing categorical variables as binary vectors. In this encoding scheme, each category is transformed into a binary vector where all elements are zero except for the one corresponding to the category’s index....

Conclusion

...