Outlier Removal in Dataset using IQR
In this example, we are using the interquartile range (IQR) method to detect and remove outliers in the ‘bmi’ column of the diabetes dataset. It calculates the upper and lower limits based on the IQR, identifies outlier indices using Boolean arrays, and then removes the corresponding rows from the DataFrame, resulting in a new DataFrame with outliers excluded. The before and after shapes of the DataFrame are printed for comparison.
Python3
# Importing import sklearn from sklearn.datasets import load_diabetes import pandas as pd # Load the dataset diabetes = load_diabetes() # Create the dataframe column_name = diabetes.feature_names df_diabetes = pd.DataFrame(diabetes.data) df_diabetes .columns = column_name df_diabetes .head() print ( "Old Shape: " , df_diabetes.shape) ''' Detection ''' # IQR # Calculate the upper and lower limits Q1 = df_diabetes[ 'bmi' ].quantile( 0.25 ) Q3 = df_diabetes[ 'bmi' ].quantile( 0.75 ) IQR = Q3 - Q1 lower = Q1 - 1.5 * IQR upper = Q3 + 1.5 * IQR # Create arrays of Boolean values indicating the outlier rows upper_array = np.where(df_diabetes[ 'bmi' ] > = upper)[ 0 ] lower_array = np.where(df_diabetes[ 'bmi' ] < = lower)[ 0 ] # Removing the outliers df_diabetes.drop(index = upper_array, inplace = True ) df_diabetes.drop(index = lower_array, inplace = True ) # Print the new shape of the DataFrame print ( "New Shape: " , df_diabetes.shape) |
Output:
Old Shape: (442, 10)
New Shape: (439, 10)
Detect and Remove the Outliers using Python
Outliers, deviating significantly from the norm, can distort measures of central tendency and affect statistical analyses. The piece explores common causes of outliers, from errors to intentional introduction, and highlights their relevance in outlier mining during data analysis.
The article delves into the significance of outliers in data analysis, emphasizing their potential impact on statistical results.