Data Wrangling  by Removing Duplication

Pandas duplicates() method helps us to remove duplicate values from Large Data. An important part of Data Wrangling is removing Duplicate values from the large data set.

Syntax: DataFrame.duplicated(subset=None, keep=’first’)

Here subset is the column value where we want to remove the Duplicate value.

In keeping, we have 3 options :

  • if keep =’first’ then the first value is marked as the original rest of all values if occur will be removed as it is considered duplicate.
  • if keep=’last’ then the last value is marked as the original rest the above same values will be removed as it is considered duplicate values.
  • if keep =’false’ all the values which occur more than once will be removed as all are considered duplicate values.

For example, A University will organize the event. In order to participate Students have to fill in their details in the online form so that they will contact them. It may be possible that a student will fill out the form multiple times. It may cause difficulty for the event organizer if a single student will fill in multiple entries. The Data that the organizers will get can be Easily Wrangles by removing duplicate values.

Creating a Student Dataset who want to participate in the event:

Python3




# Import module
import pandas as pd
 
# Initializing Data
student_data = {'Name': ['Amit', 'Praveen', 'Jagroop',
                         'Rahul', 'Vishal', 'Suraj',
                         'Rishab', 'Satyapal', 'Amit',
                         'Rahul', 'Praveen', 'Amit'],
 
                'Roll_no': [23, 54, 29, 36, 59, 38,
                            12, 45, 34, 36, 54, 23],
 
                'Email': ['xxxx@gmail.com', 'xxxxxx@gmail.com',
                          'xxxxxx@gmail.com', 'xx@gmail.com',
                          'xxxx@gmail.com', 'xxxxx@gmail.com',
                          'xxxxx@gmail.com', 'xxxxx@gmail.com',
                          'xxxxx@gmail.com', 'xxxxxx@gmail.com',
                          'xxxxxxxxxx@gmail.com', 'xxxxxxxxxx@gmail.com']}
 
# Creating Dataframe of Data
df = pd.DataFrame(student_data)
 
# Printing Dataframe
print(df)


Output:

Student Dataset who want to participate in the event

Removing Duplicate data from the Dataset using Data wrangling:

Python3




# import module
import pandas as pd
 
# initializing Data
student_data = {'Name': ['Amit', 'Praveen', 'Jagroop',
                         'Rahul', 'Vishal', 'Suraj',
                         'Rishab', 'Satyapal', 'Amit',
                         'Rahul', 'Praveen', 'Amit'],
 
                'Roll_no': [23, 54, 29, 36, 59, 38,
                            12, 45, 34, 36, 54, 23],
                'Email': ['xxxx@gmail.com', 'xxxxxx@gmail.com',
                          'xxxxxx@gmail.com', 'xx@gmail.com',
                          'xxxx@gmail.com', 'xxxxx@gmail.com',
                          'xxxxx@gmail.com', 'xxxxx@gmail.com',
                          'xxxxx@gmail.com', 'xxxxxx@gmail.com',
                          'xxxxxxxxxx@gmail.com', 'xxxxxxxxxx@gmail.com']}
 
# creating dataframe
df = pd.DataFrame(student_data)
 
# Here df.duplicated() list duplicate  Entries in ROllno.
# So that ~(NOT) is placed in order to get non duplicate values.
non_duplicate = df[~df.duplicated('Roll_no')]
 
# printing non-duplicate values
print(non_duplicate)


Output:D

Remove – Duplicate data from Dataset using Data wrangling

Creating New Datasets Using the Concatenation of Two Datasets In Data Wrangling. 

We can join two dataframe in several ways. For our example in Concanating Two datasets, we use pd.concat() function.  

Creating Two Dataframe For Concatenation.

Python3




# importing pandas module
import pandas as pd
   
# Define a dictionary containing employee data
data1 = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Age':[27, 24, 22, 32],
        'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd'],
        'Mobile No': [97, 91, 58, 76]}
     
# Define a dictionary containing employee data
data2 = {'Name':['Gaurav', 'Anuj', 'Dhiraj', 'Hitesh'],
        'Age':[22, 32, 12, 52],
        'Address':['Allahabad', 'Kannuaj', 'Allahabad', 'Kannuaj'],
        'Qualification':['MCA', 'Phd', 'Bcom', 'B.hons'],
        'Salary':[1000, 2000, 3000, 4000]}
   
# Convert the dictionary into DataFrame 
df = pd.DataFrame(data1,index=[0, 1, 2, 3])
   
# Convert the dictionary into DataFrame 
df1 = pd.DataFrame(data2, index=[2, 3, 6, 7])


We will join these two dataframe along axis 0.

Python3




res = pd.concat([df, df1])


output:

    Name    Age    Address    Qualification    Mobile No    Salary
0    Jai        27    Nagpur          Msc            97.0        NaN
1    Princi    24    Kanpur        MA            91.0        NaN
2    Gaurav    22    Allahabad    MCA            58.0        NaN
3    Anuj    32    Kannuaj        Phd            76.0        NaN
4    Gaurav    22    Allahabad    MCA            NaN            1000.0
5    Anuj    32    Kannuaj        Phd            NaN            2000.0
6    Dhiraj    12    Allahabad    Bcom        NaN            3000.0
7    Hitesh    52    Kannuaj        B.hons        NaN            4000.0

Note:- We can see that data1 does not have a salary column so all four rows of new dataframe res are Nan values.



Data Wrangling in Python

Data Wrangling is the process of gathering, collecting, and transforming Raw data into another format for better understanding, decision-making, accessing, and analysis in less time. Data Wrangling is also known as Data Munging.

Python Data Wrangling

Similar Reads

Importance Of Data Wrangling

Data Wrangling is a very important step in a Data science project. The below example will explain its importance:...

Data Wrangling in Python

Data Wrangling is a crucial topic for Data Science and Data Analysis. Pandas Framework of Python is used for Data Wrangling. Pandas is an open-source library in Python specifically developed for Data Analysis and Data Science. It is used for processes like data sorting or filtration, Data grouping, etc....

Data Wrangling  Using Merge Operation

...

Data Wrangling Using Grouping Method

...

Data Wrangling  by Removing Duplication

...