When you have a large dataset then there are the various cases when you are not getting the accurate machine learning models. Their predictions accuracy are not correct as you expected. There can be various reasons for it like Duplicates values e.t.c. One of the other reasons is Outliers. These are the values that don’t contribute to the prediction but mainly affect the other descriptive statistic values like mean, median, e.t..c. In this tutorial of “How to“, you will know how to find the handle outliers and do outlier analysis on the MultiVariant Data. (More than one variable or features). You will know.
- How to handle outliers using the Box Plot Method?
- Finding the outliers using the Scatter Plot Matrices.
First of all detecting, the outliers import all the necessary libraries for this purpose. I am writing all the code in the Jupyter notebook, therefore make sure to follow the same process with me for more understanding.
import numpy as np import pandas as pd import matplotlib.pyplot as plt from pylab import rcParams import seaborn as sb
%matplotlib inline rcParams["figure.figsize"] =10,6
About the Dataset.
For the demonstration purpose, I am using the Iris dataset. It has 5 columns with the 4 columns as the variable (feature) and the last column(species) is the target. These columns are sepal length, sepal width, petal length, petal width, species.
Lets read the dataset and define the data and the target for this dataset.
iris_data = pd.read_csv("data/iris.data.csv",header=None,sep=",") iris_data.columns = ["sepal length","sepal width","petal length","petal width", "species" ] data= iris_data.iloc[:,0:4].values # read the values of the first 4 columns target= iris_data.iloc[:,4].values # read the values of last column iris_data[:5]
In the third and fourth line, we selected the data and the target. In the data, you will choose the values of all the four columns sepal length, sepal width, petal length, petal width and for the target, you choose the species column.
How to handle outliers using the Box Plot Method?
There is a term in the box plot that is an interquartile range that is used to find the outliers in the dataset. I am not here going on the details about it. For more reading about it then you can check the Measurement of Dispersion post. It covers how to find the Interquartile range and fence.
Visualizing the best way to know anything. For seeing the outliers in the Iris dataset use the following code.
sb.boxplot(x="species",y ="sepal length",data=iris_data,palette="hls")
In the x-axis, you use the species type and the y-axis the length of the sepal length. In this case, you will find the type of the species verginica that have outliers when you consider the sepal length. You can clearly see the dot point on the species virginica.
Finding the outliers using the Scatter Plot Matrices
In the above case, we used the matplot library for finding the box plot. But in this case, I will use the Seaborn for finding the outliers using the scatter plot. The following figure will give the pair plot according to the species.
Inside the pairplot() method you will pass the 1st argument as data frame (iris_data), hue (species) for specifying the columns for labeling and palette “hls”. In the above figure, you can see the odd redpoint that doesn’t fit any of the clusters. The species in setosa , Note that point and remove the records from the excel. Here the record is at the cell 41. Delete that.
Finding outliers is an important task for data pre-processing. If there are outliers then your machine learning prediction will be not accurate. Therefore if you have a large dataset, then always make sure that the percentage of the outliers should be less than 5%.
Hope this tutorial has given you a clear understanding of how to Handle Outliers on the MultiVariant Data If you any question about dealing with data, then please contact us. You can also like our page for more “How to” tutorial.
Data Science Learner Team