How to use PyJanitor for Data Cleaning in Python In Python
1. Cleaning Column Names with PyJanitor
We can clean multiple column names at once using the clean_names()
function of PyJanitor. This function converts the names of the columns to lowercase, replaces spaces with underscores, and removes any special characters. Here’s an example of how to use this function. Let’s explore some common data cleaning tasks and how PyJanitor can simplify them.
import pandas as pd
import janitor
data = {'Column @1': [1, 2], 'Column @2': [3, 4]}
data = pd.DataFrame(data)
print(data)
Output:
Column @1 Column @2
0 1 3
1 2 4
data = data.clean_names(remove_special=True)
print(data)
Output:
column_1 column_2
0 1 3
1 2 4
2. Removing Empty Rows and Columns
We can remove empty rows and empty columns using the remove_empty()
function.
import pandas as pd
import janitor
data = {'A': [1, None, 3], 'B': [4, None, 6]}
data = pd.DataFrame(data)
data = data.remove_empty()
print(data)
Output:
A B
0 1.0 4.0
1 3.0 6.0
3. Identifying Duplicate Data Points
We can identify the data points that are repeated using the duplicated()
function, which returns True if all the columns of a data point are repeated, and False if any one is not repeated.
import pandas as pd
import janitor
data = {
'A': [1, 2, 2, 4],
'B': [5, 6, 6, 8]
}
data = pd.DataFrame(data)
duplicates = data.duplicated()
duplicates
Output:
0 False
1 False
2 True
3 False
dtype: bool
4. Encoding Object Data Type to Categorical Data Type
We can encode an object data type to a categorical data type using the encode_categorical()
function, in which we need to pass the column names for which we want to encode.
import pandas as pd
import janitor
data = {
'A': ['low', 'medium', 'high', 'medium', 'low'],
'B': ['type1', 'type2', 'type1', 'type3', 'type2']
}
data = pd.DataFrame(data)
print(data.dtypes)
# Encoding columns 'A' and 'B' as categorical
data = data.encode_categorical(columns=['A', 'B'])
print(data)
print(data.dtypes)
Output:
A object
B object
dtype: object
A B
0 low type1
1 medium type2
2 high type1
3 medium type3
4 low type2
A category
B category
dtype: object
5. Renaming Columns
Renaming columns is a common task when cleaning data. PyJanitor provides the clean_names
function to standardize column names by converting them to lowercase and replacing spaces with underscores.
# Sample DataFrame with messy column names
data = {
'First Name': [1, 2, 3, 4],
'Last Name': [5, 6, 7, 8],
'Age (Years)': [9, 10, 11, 12]
}
df = pd.DataFrame(data)
# Clean column names
cleaned_df = df.clean_names()
print(cleaned_df)
Output:
6. Filtering Data
Filtering data based on certain conditions is a common data cleaning task. PyJanitor provides the filter_string
function to filter rows based on string conditions.
# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40]
}
df = pd.DataFrame(data)
# Filter rows where Name contains 'a'
filtered_df = df.filter_string(column_name='Name', search_string='a')
print(filtered_df)
Output:
Streamlining Data Cleaning with PyJanitor: A Comprehensive Guide
Data cleaning is a crucial step in the data analysis pipeline. It involves transforming raw data into a clean dataset that can be used for analysis. This process can be time-consuming and error-prone, especially when dealing with large datasets. PyJanitor is a Python library that aims to simplify data cleaning by providing a set of convenient functions for common data cleaning tasks. In this article, we will explore PyJanitor, its features, and how it can be used to streamline the data cleaning process.
Table of Content
- What is PyJanitor?
- Key Features of PyJanitor
- Installing PyJanitor
- Using PyJanitor for Data Cleaning in Python
- 1. Cleaning Column Names with PyJanitor
- 2. Removing Empty Rows and Columns
- 3. Identifying Duplicate Data Points
- 4. Encoding Object Data Type to Categorical Data Type
- 5. Renaming Columns
- 6. Filtering Data
- Pipe() Method in PyJanitor : Custom Functions
- Exploring Different PyJanitor Functions
- 1. fill_empty(data, column_names, value)
- 2. filter_on(data, criteria, complement=False)
- 3. rename_column(data, old_column_name, new_column_name)
- 4. add_column(df, column_name, value, fill_remaining=False)