Handling Outliers
Outliers can have a very bad effect on our model like in linear regression if a data point is an outlier then it can add a very large mean square error. Removing outliers is a good process in EDA. Some models like Decisiontree and ensemble methods like RandomForests are not that much by outliers. However, it is always a good practice to handle the outlier.
Plotting Boxplot to Visualize the Outliers
Boxplots are very useful in plotting the spread and skewness of the data, it is also useful in plotting the individual’s outlier data points, they consist of the box which represents points in the range of 25% to 75% quantiles. While the line in the middle of the box represents the median and the whisker at the end of the box shows the range of below 25 % and 75% excluding outliers.
Python3
fig = plt.figure(figsize = ( 8 , 8 )) temp = dataset.drop( "Date" , axis = 1 ).columns.tolist() for i, item in enumerate (temp): plt.subplot( 2 , 3 , i + 1 ) sns.boxplot(data = dataset, x = item, color = 'violet' ) plt.tight_layout(pad = 0.4 , w_pad = 0.5 , h_pad = 2.0 ) plt.show() |
Output:
It can be seen clearly that the column ‘USO’ has outliers present in the column, so we create a function to normalize the outlier present in the column.
Python3
def outlier_removal(column): # Capping the outlier rows with Percentiles upper_limit = column.quantile(. 95 ) # set upper limit to 95percentile lower_limit = column.quantile(. 05 ) # set lower limit to 5 percentile column.loc[(column > upper_limit)] = upper_limit column.loc[(column < lower_limit)] = lower_limit return column |
Here We have set the upper limit of the column to 95 %of the data and the lower limit to the 5 %. that means that which are greater than 95% percentile of the data are normalized to the data 95% value same for the data points which are lower than 5% of the data.
Python3
# Normalize outliers in columns except Date dataset[[ 'SPX' , 'GLD' , 'USO' , 'EUR/USD' ]] = \ dataset[[ 'SPX' , 'GLD' , 'USO' , 'EUR/USD' ]]. apply (outlier_removal) |
Here using the pandas apply function We have applied the outlier_removal function to each of the rows of the columns
Gold Price Prediction using Machine Learning
In This article, We will be making a project from scratch about Gold price prediction. To build any data science project We have to follow certain steps that need not be in the same order. In our project, We will go through these steps sequentially.
- Problem Formulation
- Data preprocessing
- Data wrangling
- Model Development
- Model Explainability
- Model Deployment