When you gather a dataset for modeling a machine learning model. Then you will see the more rows of values and columns have the same values or are duplicates. Therefore its very important for you to remove duplicates from the dataset to maintain accuracy and to avoid misleading statistics. In this tutorial of “How to, ” you will learn how to remove duplicates from the dataset using the Pandas library. Let’s know the short heading what you will learn after reading the whole tutorial.
- How to create a Dataframe for demonstration Purpose?
- Search for the Duplicates values in the dataset.
- How to remove the duplicated from the dataset?
How to create a data frame?
Before removing the duplicates from the dataset. Let’s create a data set with the duplicates value. Import the panda’s library for data frame creation.
import pandas as pd
from pandas import Series, DataFrame
Create a Dataframe object.
data_obj = DataFrame({
"col1":[1,1,3,4,5,2,2],
"col2":["a","a","a","b","c","d","d"],
"col3":["A","A","B","C","B","D","D"],
})
It has 3 columns and 7 rows. But in the real-life example. You will have a large dataset and it may have many duplicates.
Check for the Duplicates values in the dataset
You have a dataset and have to check there is duplicates or not. The Python pandas library has a method for it, that is duplicated(). It checks for the duplicates rows and returns True and False. For the above-created data frame object. the code is the following.
data_obj.duplicated()
If you use the method sum() along with it, then it will return the total number of the duplicates in the dataset.
data_obj.duplicated().sum()
How to remove the duplicates from the dataset?
Now you have known that there are duplicates in the dataset and want to remove the duplicates from the dataset. There are two ways you can remove duplicates. One is deleting the entire rows and other is removing the column with the most duplicates.
Method 1: Removing the entire duplicates rows values.
For removing the entire rows that have the same values using the method drop_duplicates().
data_obj.drop_duplicates()
It will remove all duplicates values and will give a dataset with unique values.
Method 2: Remove the columns with the most duplicates
In this method instead of removing the entire rows value, you will remove the column with the most duplicates values.
drop_duplicates([colum_list])
Like in this example, assume col3 has more duplicates than the other columns, then I will remove this column only using the method.
data_obj.drop_duplicates(["col3"])
Conclusion
Preparing a dataset before designing a machine learning model is an important task for the data scientist. If there are more duplicates then making machine learning model is useless or not so accurate. Therefore you must know to remove the duplicates from the dataset.
We hope that this how-to tutorial on “Remove Duplicates from Data” helps you in understanding in cleaning the dataset. If you like this tutorial then comment on us and share it with you Data Science Geeks friends so that they can also know. In the meantime, you don’t forget to subscribe to us.
Thanks
Data Science Learner Team
Join our list
Subscribe to our mailing list and get interesting stuff and updates to your email inbox.