How to use PyJanitor for Data Cleaning in Python In Python

1. Cleaning Column Names with PyJanitor

We can clean multiple column names at once using the clean_names() function of PyJanitor. This function converts the names of the columns to lowercase, replaces spaces with underscores, and removes any special characters. Here’s an example of how to use this function. Let’s explore some common data cleaning tasks and how PyJanitor can simplify them.

Python
import pandas as pd
import janitor
data = {'Column @1': [1, 2], 'Column @2': [3, 4]}
data = pd.DataFrame(data)

print(data)

Output:

   Column @1  Column @2
0 1 3
1 2 4
Python
data = data.clean_names(remove_special=True)

print(data)

Output:

   column_1  column_2
0 1 3
1 2 4

2. Removing Empty Rows and Columns

We can remove empty rows and empty columns using the remove_empty() function.

Python
import pandas as pd
import janitor

data = {'A': [1, None, 3], 'B': [4, None, 6]}
data = pd.DataFrame(data)  

data = data.remove_empty()   

print(data)

Output:

     A    B
0 1.0 4.0
1 3.0 6.0

3. Identifying Duplicate Data Points

We can identify the data points that are repeated using the duplicated() function, which returns True if all the columns of a data point are repeated, and False if any one is not repeated.

Python
import pandas as pd
import janitor

data = {
    'A': [1, 2, 2, 4],
    'B': [5, 6, 6, 8]
}
data = pd.DataFrame(data)

duplicates = data.duplicated()
duplicates

Output:

0    False
1 False
2 True
3 False
dtype: bool

4. Encoding Object Data Type to Categorical Data Type

We can encode an object data type to a categorical data type using the encode_categorical() function, in which we need to pass the column names for which we want to encode.

Python
import pandas as pd
import janitor

data = {
    'A': ['low', 'medium', 'high', 'medium', 'low'],
    'B': ['type1', 'type2', 'type1', 'type3', 'type2']
}
data = pd.DataFrame(data)
print(data.dtypes)

# Encoding columns 'A' and 'B' as categorical
data = data.encode_categorical(columns=['A', 'B'])

print(data)
print(data.dtypes)

Output:


A object
B object
dtype: object
A B
0 low type1
1 medium type2
2 high type1
3 medium type3
4 low type2
A category
B category
dtype: object

5. Renaming Columns

Renaming columns is a common task when cleaning data. PyJanitor provides the clean_names function to standardize column names by converting them to lowercase and replacing spaces with underscores.

Python
# Sample DataFrame with messy column names
data = {
    'First Name': [1, 2, 3, 4],
    'Last Name': [5, 6, 7, 8],
    'Age (Years)': [9, 10, 11, 12]
}
df = pd.DataFrame(data)

# Clean column names
cleaned_df = df.clean_names()
print(cleaned_df)

Output:

6. Filtering Data

Filtering data based on certain conditions is a common data cleaning task. PyJanitor provides the filter_string function to filter rows based on string conditions.

Python
# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40]
}
df = pd.DataFrame(data)

# Filter rows where Name contains 'a'
filtered_df = df.filter_string(column_name='Name', search_string='a')
print(filtered_df)

Output:


Streamlining Data Cleaning with PyJanitor: A Comprehensive Guide

Data cleaning is a crucial step in the data analysis pipeline. It involves transforming raw data into a clean dataset that can be used for analysis. This process can be time-consuming and error-prone, especially when dealing with large datasets. PyJanitor is a Python library that aims to simplify data cleaning by providing a set of convenient functions for common data cleaning tasks. In this article, we will explore PyJanitor, its features, and how it can be used to streamline the data cleaning process.

Table of Content

  • What is PyJanitor?
  • Key Features of PyJanitor
  • Installing PyJanitor
  • Using PyJanitor for Data Cleaning in Python
    • 1. Cleaning Column Names with PyJanitor
    • 2. Removing Empty Rows and Columns
    • 3. Identifying Duplicate Data Points
    • 4. Encoding Object Data Type to Categorical Data Type
    • 5. Renaming Columns
    • 6. Filtering Data
  • Pipe() Method in PyJanitor : Custom Functions
  • Exploring Different PyJanitor Functions
    • 1. fill_empty(data, column_names, value)
    • 2. filter_on(data, criteria, complement=False)
    • 3. rename_column(data, old_column_name, new_column_name)
    • 4. add_column(df, column_name, value, fill_remaining=False)

Similar Reads

What is PyJanitor?

PyJanitor is an open-source Python library built on top of Pandas, designed to extend its functionality with additional data cleaning features. It provides a set of functions that make it easier to perform common data cleaning tasks, such as removing missing values, renaming columns, and filtering data. PyJanitor aims to make data cleaning more efficient and less error-prone by providing a consistent and intuitive API....

Key Features of PyJanitor

PyJanitor offers a variety of features that simplify data cleaning:...

Installing PyJanitor

To get started with PyJanitor, you need to install it. You can install PyJanitor using pip:...

Using PyJanitor for Data Cleaning in Python

1. Cleaning Column Names with PyJanitor...

Pipe() Method in PyJanitor : Custom Functions

The pipe() method of PyJanitor is used to chain multiple data-cleaning operations. This method helps us to write more readable code. We can do a series of operations in a clear manner, making it easier to understand. Here’s an example of how to use this function....

Exploring Different PyJanitor Functions

Now that we have understood the main features of PyJanitor, let’s dive deep into some other main functions....

Conclusion

In conclusion, PyJanitor is a useful library for data cleaning in Python. It has many functions that can make the data-cleaning process simple and fast. One of the main features of PyJanitor is that we can chain multiple data-cleaning operations into one step, improving the readability of the code. PyJanitor doesn’t just provide basic data cleaning operations, but it also provides functions that can be used for complex operations. Hence, the next time you need to do data cleaning in your project give PyJanitor a try....