Data Wrangling in Python

Data Wrangling is a crucial topic for Data Science and Data Analysis. Pandas Framework of Python is used for Data Wrangling. Pandas is an open-source library in Python specifically developed for Data Analysis and Data Science. It is used for processes like data sorting or filtration, Data grouping, etc.

Data wrangling in Python deals with the below functionalities:

  1. Data exploration: In this process, the data is studied, analyzed, and understood by visualizing representations of data.
  2. Dealing with missing values: Most of the datasets having a vast amount of data contain missing values of NaN, they are needed to be taken care of by replacing them with mean, mode, the most frequent value of the column, or simply by dropping the row having a NaN value.
  3. Reshaping data: In this process, data is manipulated according to the requirements, where new data can be added or pre-existing data can be modified.
  4. Filtering data: Some times datasets are comprised of unwanted rows or columns which are required to be removed or filtered
  5. Other: After dealing with the raw dataset with the above functionalities we get an efficient dataset as per our requirements and then it can be used for a required purpose like data analyzing, machine learning, data visualization, model training etc.

Below are  examples of Data Wrangling that implements the above functionalities on a raw dataset:

Data exploration in Python

 Here in Data exploration, we load the data into a dataframe, and then we visualize the data in a tabular format.

Python3




# Import pandas package
import pandas as pd
 
# Assign data
data = {'Name': ['Jai', 'Princi', 'Gaurav',
                 'Anuj', 'Ravi', 'Natasha', 'Riya'],
        'Age': [17, 17, 18, 17, 18, 17, 17],
        'Gender': ['M', 'F', 'M', 'M', 'M', 'F', 'F'],
        'Marks': [90, 76, 'NaN', 74, 65, 'NaN', 71]}
 
# Convert into DataFrame
df = pd.DataFrame(data)
 
# Display data
df


Output:

defining the dataframe and displaying in tabular format

Dealing with missing values in Python

As we can see from the previous output, there are NaN values present in the MARKS column which is a missing value in the dataframe that is going to be taken care of in data wrangling by replacing them with the column mean.

Python3




# Compute average
c = avg = 0
for ele in df['Marks']:
    if str(ele).isnumeric():
        c += 1
        avg += ele
avg /= c
 
# Replace missing values
df = df.replace(to_replace="NaN",
                value=avg)
 
# Display data
df


Output:

replacing Nan values with average

Data Replacing in Data Wrangling

in the GENDER column, we can replace the Gender column data by categorizing them into different numbers.

Python3




# Categorize gender
df['Gender'] = df['Gender'].map({'M': 0,
                                 'F': 1, }).astype(float)
 
# Display data
df


Output:

Data encoding for gender variable in data wrangling 

Filtering data in Data Wrangling

suppose there is a requirement for the details regarding name, gender, and marks of the top-scoring students. Here we need to remove some using the pandas slicing method in data wrangling from unwanted data.

Python3




# Filter top scoring students
df = df[df['Marks'] >= 75].copy()
 
# Remove age column from filtered DataFrame
df.drop('Age', axis=1, inplace=True)
 
# Display data
df


Output:

Dropping column and filtering rows

Hence, we have finally obtained an efficient dataset that can be further used for various purposes. 

Now that we have seen the basics of data wrangling using Python and pandas. Below we will discuss various operations using which we can perform data wrangling:

Data Wrangling in Python

Data Wrangling is the process of gathering, collecting, and transforming Raw data into another format for better understanding, decision-making, accessing, and analysis in less time. Data Wrangling is also known as Data Munging.

Python Data Wrangling

Similar Reads

Importance Of Data Wrangling

Data Wrangling is a very important step in a Data science project. The below example will explain its importance:...

Data Wrangling in Python

Data Wrangling is a crucial topic for Data Science and Data Analysis. Pandas Framework of Python is used for Data Wrangling. Pandas is an open-source library in Python specifically developed for Data Analysis and Data Science. It is used for processes like data sorting or filtration, Data grouping, etc....

Data Wrangling  Using Merge Operation

...

Data Wrangling Using Grouping Method

...

Data Wrangling  by Removing Duplication

...