Feature Engineering
Now let’s see which columns should we drop and/or modify for the model to predict the testing data. The main tasks in this step is to drop unnecessary features and to convert string data into the numerical category for easier training.
We’ll start off by dropping the Cabin feature since not a lot more useful information can be extracted from it. But we will make a new column from the Cabins column to see if there was cabin information allotted or not.
Python3
# Create a new column cabinbool indicating # if the cabin value was given or was NaN train[ "CabinBool" ] = (train[ "Cabin" ].notnull().astype( 'int' )) test[ "CabinBool" ] = (test[ "Cabin" ].notnull().astype( 'int' )) # Delete the column 'Cabin' from test # and train dataset train = train.drop([ 'Cabin' ], axis = 1 ) test = test.drop([ 'Cabin' ], axis = 1 ) |
We can also drop the Ticket feature since it’s unlikely to yield any useful information
Python3
train = train.drop([ 'Ticket' ], axis = 1 ) test = test.drop([ 'Ticket' ], axis = 1 ) |
There are missing values in the Embarked feature. For that, we will replace the NULL values with ‘S’ as the number of Embarks for ‘S’ are higher than the other two.
Python3
# replacing the missing values in # the Embarked feature with S train = train.fillna({ "Embarked" : "S" }) |
We will now sort the age into groups. We will combine the age groups of the people and categorize them into the same groups. BY doing so we will be having fewer categories and will have a better prediction since it will be a categorical dataset.
Python3
# sort the ages into logical categories train[ "Age" ] = train[ "Age" ].fillna( - 0.5 ) test[ "Age" ] = test[ "Age" ].fillna( - 0.5 ) bins = [ - 1 , 0 , 5 , 12 , 18 , 24 , 35 , 60 , np.inf] labels = [ 'Unknown' , 'Baby' , 'Child' , 'Teenager' , 'Student' , 'Young Adult' , 'Adult' , 'Senior' ] train[ 'AgeGroup' ] = pd.cut(train[ "Age" ], bins, labels = labels) test[ 'AgeGroup' ] = pd.cut(test[ "Age" ], bins, labels = labels) |
In the ‘title’ column for both the test and train set, we will categorize them into an equal number of classes. Then we will assign numerical values to the title for convenience of model training.
Python3
# create a combined group of both datasets combine = [train, test] # extract a title for each Name in the # train and test datasets for dataset in combine: dataset[ 'Title' ] = dataset.Name. str .extract( ' ([A-Za-z]+)\.' , expand = False ) pd.crosstab(train[ 'Title' ], train[ 'Sex' ]) # replace various titles with more common names for dataset in combine: dataset[ 'Title' ] = dataset[ 'Title' ].replace([ 'Lady' , 'Capt' , 'Col' , 'Don' , 'Dr' , 'Major' , 'Rev' , 'Jonkheer' , 'Dona' ], 'Rare' ) dataset[ 'Title' ] = dataset[ 'Title' ].replace( [ 'Countess' , 'Lady' , 'Sir' ], 'Royal' ) dataset[ 'Title' ] = dataset[ 'Title' ].replace( 'Mlle' , 'Miss' ) dataset[ 'Title' ] = dataset[ 'Title' ].replace( 'Ms' , 'Miss' ) dataset[ 'Title' ] = dataset[ 'Title' ].replace( 'Mme' , 'Mrs' ) train[[ 'Title' , 'Survived' ]].groupby([ 'Title' ], as_index = False ).mean() # map each of the title groups to a numerical value title_mapping = { "Mr" : 1 , "Miss" : 2 , "Mrs" : 3 , "Master" : 4 , "Royal" : 5 , "Rare" : 6 } for dataset in combine: dataset[ 'Title' ] = dataset[ 'Title' ]. map (title_mapping) dataset[ 'Title' ] = dataset[ 'Title' ].fillna( 0 ) |
Now using the title information we can fill in the missing age values.
Python3
mr_age = train[train[ "Title" ] = = 1 ][ "AgeGroup" ].mode() # Young Adult miss_age = train[train[ "Title" ] = = 2 ][ "AgeGroup" ].mode() # Student mrs_age = train[train[ "Title" ] = = 3 ][ "AgeGroup" ].mode() # Adult master_age = train[train[ "Title" ] = = 4 ][ "AgeGroup" ].mode() # Baby royal_age = train[train[ "Title" ] = = 5 ][ "AgeGroup" ].mode() # Adult rare_age = train[train[ "Title" ] = = 6 ][ "AgeGroup" ].mode() # Adult age_title_mapping = { 1 : "Young Adult" , 2 : "Student" , 3 : "Adult" , 4 : "Baby" , 5 : "Adult" , 6 : "Adult" } for x in range ( len (train[ "AgeGroup" ])): if train[ "AgeGroup" ][x] = = "Unknown" : train[ "AgeGroup" ][x] = age_title_mapping[train[ "Title" ][x]] for x in range ( len (test[ "AgeGroup" ])): if test[ "AgeGroup" ][x] = = "Unknown" : test[ "AgeGroup" ][x] = age_title_mapping[test[ "Title" ][x]] |
Now assign a numerical value to each age category. Once we have mapped the age into different categories we do not need the age feature. Hence drop it
Python3
# map each Age value to a numerical value age_mapping = { 'Baby' : 1 , 'Child' : 2 , 'Teenager' : 3 , 'Student' : 4 , 'Young Adult' : 5 , 'Adult' : 6 , 'Senior' : 7 } train[ 'AgeGroup' ] = train[ 'AgeGroup' ]. map (age_mapping) test[ 'AgeGroup' ] = test[ 'AgeGroup' ]. map (age_mapping) train.head() # dropping the Age feature for now, might change train = train.drop([ 'Age' ], axis = 1 ) test = test.drop([ 'Age' ], axis = 1 ) |
Drop the name feature since it contains no more useful information.
Python3
train = train.drop([ 'Name' ], axis = 1 ) test = test.drop([ 'Name' ], axis = 1 ) |
Assign numerical values to sex and embarks categories\
Python3
sex_mapping = { "male" : 0 , "female" : 1 } train[ 'Sex' ] = train[ 'Sex' ]. map (sex_mapping) test[ 'Sex' ] = test[ 'Sex' ]. map (sex_mapping) embarked_mapping = { "S" : 1 , "C" : 2 , "Q" : 3 } train[ 'Embarked' ] = train[ 'Embarked' ]. map (embarked_mapping) test[ 'Embarked' ] = test[ 'Embarked' ]. map (embarked_mapping) |
Fill in the missing Fare value in the test set based on the mean fare for that P-class
Python3
for x in range ( len (test[ "Fare" ])): if pd.isnull(test[ "Fare" ][x]): pclass = test[ "Pclass" ][x] # Pclass = 3 test[ "Fare" ][x] = round ( train[train[ "Pclass" ] = = pclass][ "Fare" ].mean(), 4 ) # map Fare values into groups of # numerical values train[ 'FareBand' ] = pd.qcut(train[ 'Fare' ], 4 , labels = [ 1 , 2 , 3 , 4 ]) test[ 'FareBand' ] = pd.qcut(test[ 'Fare' ], 4 , labels = [ 1 , 2 , 3 , 4 ]) # drop Fare values train = train.drop([ 'Fare' ], axis = 1 ) test = test.drop([ 'Fare' ], axis = 1 ) |
Now we are done with the feature engineering
Titanic Survival Prediction Using Machine Learning
In this article, we will learn to predict the survival chances of the Titanic passengers using the given information about their sex, age, etc. As this is a classification task we will be using random forest.
There will be three main steps in this experiment:
- Feature Engineering
- Imputation
- Training and Prediction