Hierarchical Clustering is a type of the Unsupervised Machine Learning algorithm that is used for labeling the dataset. When you hear the words labeling the dataset, it means you are clustering the data points that have the same characteristics. It allows you to predict the subgroups from the dataset. In this tutorial of “How to, ” you will learn How to Do Hierarchical Clustering in Python?
Before going to the coding part to learn Hierarchical Clustering in python more, you must know the some of the terms that give you more understanding. It’s just a brief summary.
What is Hierarchical Clustering?
Hierarchical Clustering uses the distance based approach between the neighbor datapoints for clustering. Each data point is linked to its nearest neighbors. There are two ways you can do Hierarchical clustering Agglomerative that is bottom-up approach clustering and Divisive uses top-down approaches for clustering. In this tutorial, I will use the popular approach Agglomerative way.
In order to find the number of subgroups in the dataset, you use dendrogram. It allows you to see linkages, relatedness using the tree graph.
You will find many use cases for this type of clustering and some of them are DNA sequencing, Sentiment Analysis, Tracking Virus Diseases e.t.c. Popular Use Cases are Hospital Resource Management, Business Process Management, and Social Network Analysis.
Easy Steps to Do Hierarchical Clustering in Python
Step 1: Import the necessary Libraries for the Hierarchical Clustering
import numpy as np import pandas as pd import scipy from scipy.cluster.hierarchy import dendrogram,linkage from scipy.cluster.hierarchy import fcluster from scipy.cluster.hierarchy import cophenet from scipy.spatial.distance import pdist import matplotlib.pyplot as plt from pylab import rcParams import seaborn as sb import sklearn from sklearn import datasets from sklearn.cluster import AgglomerativeClustering import sklearn.metrics as sm from sklearn.preprocessing import scale
Here we are importing dendrogram, linkage, cluster, and cophenet from the scipy.cluster.hierarchy packages.
Step 2: Import the libraries for the Data Visualization
#Configure the output np.set_printoptions(precision=4,suppress=True) %matplotlib inline rcParams["figure.figsize"] =20,10 sb.set_style("whitegrid")
The first line np.set_printoptions(precision=4,suppress=True ) method will tell the python interpreter to use float datapoints up to 4 digits after the decimal. I am not going about it in detail. If you want to read about it then here is the link for the numpy.set_printoptions.
Step 3: Load the Dataset
You can use your own dataset, but I am using only the default Iris dataset loaded from the Sklearn.
iris = datasets.load_iris() #scale the data data = scale(iris.data) target = pd.DataFrame(iris.target) variable_names = iris.feature_names data[0:10]
Here data is the input variable(scaled data) and the target is the output variable. In this case, data is sepal length, sepal length, petal length, and petal width. The target is Iris species type.
Please note that you should always scale the data for accurate prediction.
Step 4: Draw the Dendrogram of the dataset.
In order to estimate the number of centroids. You should verify the number of clusters visually. In this case, you will use the dendrogram. Use the following code.
z = linkage(data,"ward")
#generate dendrogram dendrogram(z,truncate_mode= "lastp", p =12, leaf_rotation=45,leaf_font_size=15, show_contracted=True) plt.title("Truncated Hierachial Clustering Dendrogram") plt.xlabel("Cluster Size") plt.ylabel("Distance") #divide the cluster plt.axhline(y=15) plt.axhline(5) plt.axhline(10) plt.show()
There are three ways you can link the data points Ward, Complete and Average. In my case, I am using ward linkage parameter. It’s just for the visualizing the dendrogram. But at the last, you will take the distance metrics and linkage parameters on the accuracy score of the model.
Using the Iris dataset and its dendrogram, you can clearly see at distance approx y= 9 Line has divided into three clusters. And also the dataset has three types of species. It means you should choose k=3, that is the number of clusters.
Step 5: Generate the Hierarchical cluster.
In this step, you will generate a Hierarchical Cluster using the various affinity and linkage methods. Doing this you will generate different accuracy score. You will choose the method with the largest score.
#based on the dendrogram we have two clusetes k =3 #build the model HClustering = AgglomerativeClustering(n_clusters=k , affinity="euclidean",linkage="ward") #fit the model on the dataset HClustering.fit(data) #accuracy of the model sm.accuracy_score(target,HClustering.labels_)
You can see in the code I am using Agglomerative Clustering with 3 clusters, Euclidean distance parameters and ward as the linkage parameter. Sklearn metrics sm gives the accuracy score of the model. You may Keep on changing the affinity (Euclidean, Manhatten, Cosine ) and linkage (ward, complete, average) until you get the best accuracy scores.
I get the highest accuracy score of 0.68 when used Euclidean as affinity and the average as linkage parameters. Thus It’s obvious that I will choose the third one as Hierarchal Clustering model for the Iris Dataset.
Hierarchical Clustering is a very good way to label the unlabeled dataset. Hierarchical agglomerative clustering (HAC) has a time complexity of O(n^3). Thus making it too slow. Therefore, the machine learning algorithm is good for the small dataset. Avoid it to apply it on the large dataset.
We hope now you now have fully understood the concepts of Hierarchical Clustering. If you have any questions regarding it then you can directly message on our Facebook Page. We are always ready to help you. And the last Don’t forget to subscribe us.
Data Science Learner Team