Hierarchical Clustering is a type of the Unsupervised Machine Learning algorithm that is used for labeling the dataset. When you hear the words labeling the dataset, it means you are clustering the data points that have the same characteristics. It allows you to predict the subgroups from the dataset. In this tutorial of “How to, ” you will learn How to Do Hierarchical Clustering in Python?
Before going to the coding part to learn Hierarchical Clustering in python more, you must know the some of the terms that give you more understanding. It’s just a brief summary.
What is Hierarchical Clustering?
Hierarchical Clustering uses the distance based approach between the neighbor datapoints for clustering. Each data point is linked to its nearest neighbors. There are two ways you can do Hierarchical clustering Agglomerative that is bottom-up approach clustering and Divisive uses top-down approaches for clustering. In this tutorial, I will use the popular approach Agglomerative way.
In order to find the number of subgroups in the dataset, you use dendrogram. It allows you to see linkages, relatedness using the tree graph.
You will find many use cases for this type of clustering and some of them are DNA sequencing, Sentiment Analysis, Tracking Virus Diseases e.t.c. Popular Use Cases are Hospital Resource Management, Business Process Management, and Social Network Analysis.
Easy Steps to Do Hierarchical Clustering in Python
Step 1: Import the necessary Libraries for the Hierarchical Clustering
import numpy as np
import pandas as pd
import scipy
from scipy.cluster.hierarchy import dendrogram,linkage
from scipy.cluster.hierarchy import fcluster
from scipy.cluster.hierarchy import cophenet
from scipy.spatial.distance import pdist
import matplotlib.pyplot as plt
from pylab import rcParams
import seaborn as sb
import sklearn
from sklearn import datasets
from sklearn.cluster import AgglomerativeClustering
import sklearn.metrics as sm
from sklearn.preprocessing import scale
Here we are importing dendrogram, linkage, cluster, and cophenet from the scipy.cluster.hierarchy packages.
Step 2: Import the libraries for the Data Visualization
#Configure the output
np.set_printoptions(precision=4,suppress=True)
%matplotlib inline
rcParams["figure.figsize"] =20,10
sb.set_style("whitegrid")
The first line np.set_printoptions(precision=4,suppress=True ) method will tell the python interpreter to use float datapoints up to 4 digits after the decimal. I am not going about it in detail. If you want to read about it then here is the link for the numpy.set_printoptions.
Step 3: Load the Dataset
You can use your own dataset, but I am using only the default Iris dataset loaded from the Sklearn.
iris = datasets.load_iris()
#scale the data
data = scale(iris.data)
target = pd.DataFrame(iris.target)
variable_names = iris.feature_names
data[0:10]
Here data is the input variable(scaled data) and the target is the output variable. In this case, data is sepal length, sepal length, petal length, and petal width. The target is Iris species type.
Please note that you should always scale the data for accurate prediction.
Step 4: Draw the Dendrogram of the dataset.
In order to estimate the number of centroids. You should verify the number of clusters visually. In this case, you will use the dendrogram. Use the following code.
z = linkage(data,"ward")
#generate dendrogram
dendrogram(z,truncate_mode= "lastp", p =12, leaf_rotation=45,leaf_font_size=15, show_contracted=True)
plt.title("Truncated Hierachial Clustering Dendrogram")
plt.xlabel("Cluster Size")
plt.ylabel("Distance")
#divide the cluster
plt.axhline(y=15)
plt.axhline(5)
plt.axhline(10)
plt.show()
There are three ways you can link the data points Ward, Complete and Average. In my case, I am using ward linkage parameter. It’s just for the visualizing the dendrogram. But at the last, you will take the distance metrics and linkage parameters on the accuracy score of the model.
Using the Iris dataset and its dendrogram, you can clearly see at distance approx y= 9 Line has divided into three clusters. And also the dataset has three types of species. It means you should choose k=3, that is the number of clusters.
Step 5: Generate the Hierarchical cluster.
In this step, you will generate a Hierarchical Cluster using the various affinity and linkage methods. Doing this you will generate different accuracy score. You will choose the method with the largest score.
#based on the dendrogram we have two clusetes
k =3
#build the model
HClustering = AgglomerativeClustering(n_clusters=k , affinity="euclidean",linkage="ward")
#fit the model on the dataset
HClustering.fit(data)
#accuracy of the model
sm.accuracy_score(target,HClustering.labels_)
You can see in the code I am using Agglomerative Clustering with 3 clusters, Euclidean distance parameters and ward as the linkage parameter. Sklearn metrics sm gives the accuracy score of the model. You may Keep on changing the affinity (Euclidean, Manhatten, Cosine ) and linkage (ward, complete, average) until you get the best accuracy scores.
I get the highest accuracy score of 0.68 when used Euclidean as affinity and the average as linkage parameters. Thus It’s obvious that I will choose the third one as Hierarchal Clustering model for the Iris Dataset.
Other Clustering Alternatives –
Apart from the above one technique for clustering you may choose K-mean clustering technique for large data also.
Conclusion
Hierarchical Clustering is a very good way to label the unlabeled dataset. Hierarchical agglomerative clustering (HAC) has a time complexity of O(n^3). Thus making it too slow. Therefore, the machine learning algorithm is good for the small dataset. Avoid it to apply it on the large dataset.
We hope now you now have fully understood the concepts of Hierarchical Clustering. If you have any questions regarding it then you can directly message on our Facebook Page. We are always ready to help you. And the last Don’t forget to subscribe us.
Thanks
Data Science Learner Team
Join our list
Subscribe to our mailing list and get interesting stuff and updates to your email inbox.