The full form of the DBSCAN is Density Based Spatial Clustering of Applications with Noise. It is an Unsupervised Clustering algorithm that is mostly used in data mining and machine learning. Detecting and removing outliers from the dataset is a necessary step before processing the data. In this tutorial of “How to“, you will learn how to detect outliers using DBSCAN method.
DBSCAN consider the two most important factors for detecting the outliers. One Euclidian distance or some other distance and the other minimum number of points. It clusters the datapoint into the two regions based on the closest distance between the points.
- High-Density region
- Low-Density Region
The outliers are generally in the low-density region. Thus making it very easy to find the outliers and removing them from the dataset.
DBSCAN method uses the two important parameters for clustering and returns -1 if the points are not assigned to any cluster.
It defines the maximum distance between the two samples that are to be clustered in the same neighborhood. In general, you should start with the eps value of 0.1.
It is a minimum number of samples in a neighborhood for a data point to qualify as a core point. You should always start with a low sample size.
Let’s find the outliers using the Sklearn DBSCAN method.
Step1: Import the necessary library for DBSCAN method
import numpy as np import pandas as pd import matplotlib.pyplot as plt from pylab import rcParams import seaborn as sb import sklearn from sklearn.cluster import DBSCAN from collections import Counter
Step 2: Define the standard parameter for the Data Visualization
%matplotlib inline rcParams["figure.figsize"] =10,6
Step 3: Read the Dataset from the CSV datasets.
In this step, you can import the dataset from the Default sklearn dataset. For the demonstration purpose, I am using the Iris Dataset that is already downloaded. Use the following code.
iris_data = pd.read_csv("data/iris.data.csv",header=None,sep=",") iris_data.columns = ["sepal length","sepal width","petal length","petal width", "species" ] data= iris_data.iloc[:,0:4].values target= iris_data.iloc[:,4].values
Here data is the first four columns of the Iris that are sepal length, sepal width, petal length, “petal width and the target is the species column of the dataset.
Step 4: Model the DBSCAN
In this step, you will model the DBSCAN by using the eps and min_samples parameter and fit the dataset. To do this use the following code
You can see the model output. These are arguments that are used for modeling the dataset like algorithm for clustering auto, the metric is Euclidean for measuring the distance between the two points.
Here you can use any value but here I am using eps of 0.8 that is the maximum distance between the two samples. and min_samples 19 that is the number of samples in a neighborhood.
Step 5: Visualize the Results
Before visualization and seeing the outliers you have to convert the data values into the data frame. and use the model to detect the outliers.
outlier_df =pd.DataFrame(data) print (Counter( model.labels_))
Counter() method will return a list. It checks the number of the data points (outliers) that is assigned to each label. Records with the negative one are the outliers. Total observation are 150 that is 94 points in dense region(1) , 50 in sparse region (0) and 6 are outliers (-1).
To print, all the outliers as a table check model labels.
print (outlier_df[model.labels_ == -1])
It will print all the outliers in the dataset according to your defined model.
In order to plot all the outliers Use the following code. It will plot the scatter plots.
fig =plt.figure() ax = fig.add_axes([.1,.1,1,1]) colors = model.labels_ ax.scatter(x= data[:,2],y = data[:,1],c=colors, s =120) ax.set_xlabel("Petal Lenght") ax.set_ylabel("Sepal Width") plt.title("DBSCAN for Outlier Detection")
In the above figure you can see there are You can clearly see 50 points in the sparse region (Green Type Color, I guess), Dense region (Yellow color). You can clearly see the 6 violet color dots that are the outliers. They are not in dense regions as well as the sparse region.
Outliers do not contribute to machine learning accuracy, in fact, they prevent you from predicting the right things. Thus it’s very necessary you should detect and remove the outliers for the sake of accuracy. DBSCAN method is one of the popular ways for dividing the dataset into two part dense region and sparse region. This way you can easily identify the outliers in the dataset. There are also other ways of detecting outliers. You can read at our site.
Hope you have found the method to detect and remove outliers from the dataset. If you liked the tutorial and want to give any feedback then contact us and also you can like our page for exclusive new “How to Tutorial”.
Data Science Learner Team