How to use PyJanitor for Data Cleaning in Python In Python

1. Cleaning Column Names with PyJanitor

We can clean multiple column names at once using the clean_names() function of PyJanitor. This function converts the names of the columns to lowercase, replaces spaces with underscores, and removes any special characters. Here’s an example of how to use this function. Let’s explore some common data cleaning tasks and how PyJanitor can simplify them.

Python

import pandas as pd
import janitor
data = {'Column @1': [1, 2], 'Column @2': [3, 4]}
data = pd.DataFrame(data)

print(data)

Output:

   Column @1  Column @2
0          1          3
1          2          4

Python

data = data.clean_names(remove_special=True)

print(data)

Output:

   column_1  column_2
0         1         3
1         2         4

2. Removing Empty Rows and Columns

We can remove empty rows and empty columns using the remove_empty() function.

Python

import pandas as pd
import janitor

data = {'A': [1, None, 3], 'B': [4, None, 6]}
data = pd.DataFrame(data)  

data = data.remove_empty()   

print(data)

Output:

     A    B
0  1.0  4.0
1  3.0  6.0

3. Identifying Duplicate Data Points

We can identify the data points that are repeated using the duplicated() function, which returns True if all the columns of a data point are repeated, and False if any one is not repeated.

Python

import pandas as pd
import janitor

data = {
    'A': [1, 2, 2, 4],
    'B': [5, 6, 6, 8]
}
data = pd.DataFrame(data)

duplicates = data.duplicated()
duplicates

Output:

0    False
1    False
2     True
3    False
dtype: bool

4. Encoding Object Data Type to Categorical Data Type

We can encode an object data type to a categorical data type using the encode_categorical() function, in which we need to pass the column names for which we want to encode.

Python

import pandas as pd
import janitor

data = {
    'A': ['low', 'medium', 'high', 'medium', 'low'],
    'B': ['type1', 'type2', 'type1', 'type3', 'type2']
}
data = pd.DataFrame(data)
print(data.dtypes)

# Encoding columns 'A' and 'B' as categorical
data = data.encode_categorical(columns=['A', 'B'])

print(data)
print(data.dtypes)

Output:


A    object
B    object
dtype: object
        A      B
0     low  type1
1  medium  type2
2    high  type1
3  medium  type3
4     low  type2
A    category
B    category
dtype: object

5. Renaming Columns

Renaming columns is a common task when cleaning data. PyJanitor provides the clean_names function to standardize column names by converting them to lowercase and replacing spaces with underscores.

Python

# Sample DataFrame with messy column names
data = {
    'First Name': [1, 2, 3, 4],
    'Last Name': [5, 6, 7, 8],
    'Age (Years)': [9, 10, 11, 12]
}
df = pd.DataFrame(data)

# Clean column names
cleaned_df = df.clean_names()
print(cleaned_df)

Output:

6. Filtering Data

Filtering data based on certain conditions is a common data cleaning task. PyJanitor provides the filter_string function to filter rows based on string conditions.

Python

# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40]
}
df = pd.DataFrame(data)

# Filter rows where Name contains 'a'
filtered_df = df.filter_string(column_name='Name', search_string='a')
print(filtered_df)

Output:

Streamlining Data Cleaning with PyJanitor: A Comprehensive Guide

Data cleaning is a crucial step in the data analysis pipeline. It involves transforming raw data into a clean dataset that can be used for analysis. This process can be time-consuming and error-prone, especially when dealing with large datasets. PyJanitor is a Python library that aims to simplify data cleaning by providing a set of convenient functions for common data cleaning tasks. In this article, we will explore PyJanitor, its features, and how it can be used to streamline the data cleaning process.

Table of Content

What is PyJanitor?
Key Features of PyJanitor
Installing PyJanitor
Using PyJanitor for Data Cleaning in Python

1. Cleaning Column Names with PyJanitor
2. Removing Empty Rows and Columns
3. Identifying Duplicate Data Points
4. Encoding Object Data Type to Categorical Data Type
5. Renaming Columns
6. Filtering Data

Pipe() Method in PyJanitor : Custom Functions
Exploring Different PyJanitor Functions

1. fill_empty(data, column_names, value)
2. filter_on(data, criteria, complement=False)
3. rename_column(data, old_column_name, new_column_name)
4. add_column(df, column_name, value, fill_remaining=False)

How to use PyJanitor for Data Cleaning in Python In Python

1. Cleaning Column Names with PyJanitor

2. Removing Empty Rows and Columns

3. Identifying Duplicate Data Points

4. Encoding Object Data Type to Categorical Data Type

5. Renaming Columns

6. Filtering Data

Streamlining Data Cleaning with PyJanitor: A Comprehensive Guide

Categories

Contact US

How to use PyJanitor for Data Cleaning in Python In Python

1. Cleaning Column Names with PyJanitor

2. Removing Empty Rows and Columns

3. Identifying Duplicate Data Points

4. Encoding Object Data Type to Categorical Data Type

5. Renaming Columns

6. Filtering Data

Streamlining Data Cleaning with PyJanitor: A Comprehensive Guide

Similar Reads

Categories

Contact US