There are some cases when you have a dataset that is mostly unlabeled. The problems start when you want to structure the datasets and make it valuable by labeling it. In machine learning, there are various methods for labeling these datasets. Clustering is one of them. In this tutorial of “How to“, you will learn to do K Means Clustering in Python.
What is K Means Clustering Algorithm?
It is a clustering algorithm that is a simple Unsupervised algorithm used to predict groups from an unlabeled dataset. In the K Means clustering predictions are dependent or based on the two values.
1.The number of cluster centers ( Centroid k)
2. Nearest Mean value between the observations.
There are many popular use cases of the K Means Clustering and some of them are Price and cost Modeling of a Specific Market, Fraud Detection, Portfolio or Hedge Fund Mangement.
Before going in details and coding part of the K Mean Clustering in Python, you should keep in mind that Clustering always done on Scaled Variable (Normalized). It means the Mean should be zero and the sum of the covariance should be equal to one. And the other things to remember is the use of scatter plot or the data table for taking the estimated number of the centroids or the cluster centers (k).
Step 1: Import the necessary Library required for K means Clustering model
import pandas as pd import numpy as np import matplotlib.pyplot as plt from pylab import rcParams #sklearn import sklearn from sklearn.cluster import KMeans from sklearn.preprocessing import scale # for scaling the data import sklearn.metrics as sm # for evaluating the model from sklearn import datasets from sklearn.metrics import confusion_matrix,classification_report
Step 2: Define the Parameters for the Visualization
%matplotlib inline rcParams["figure.figsize"] =20,10
I am using the Jupyter notebook there for showing the figure inline, I am calling the statement %matplotlib inline.
Step 3: Load and scale the Dataset.
I am loading the default sklearn Iris dataset. You can also use your own dataset. But for the demonstration, I am using the default dataset.
iris = datasets.load_iris() #scale the data data = scale(iris.data) # scale the iris data target = pd.DataFrame(iris.target) # define the target variable_names = iris.feature_names data[0:10]
Here the data is the scaled data and the target is the species of the data.
Please note that the data[0:10] will return the np array only.
Step 4: Build the Cluster Model and model the output
In this step, you will build the K means cluster model and will call the fit() method for the dataset. After that, you will mode the output for the data visualization.
clustering = KMeans(n_clusters=3,random_state=5) #fit the dataset clustering.fit(data) iris_df = pd.DataFrame(iris.data) iris_df.columns = ["sepal_length","sepal_width","petal_length","petal_width" ] target.columns =["Target"]
The above output defines the KMeans() cluster method has been called. You can see there are various arguments are defined inside the method. The type of the algorithm, the number of clusters (n_clusters). e.t.c. You can know about it here. K-Means clustering
Step 5: Plot the Model Output using Matplotlib
colors = np.array(["Red","Green","Blue"]) plt.subplot(1,2,1) plt.scatter(x=iris_df["petal_length"] ,y= iris_df["petal_width"],c = colors[iris.target],s=50) plt.title("Before K Means Classificaion") plt.subplot(1,2,2) plt.scatter(x=iris_df["petal_length"] ,y= iris_df["petal_width"],c = colors[clustering.labels_],s=50) plt.title("K means Classifcation")
Both figures suggest that the model has accurately predicted clusters. The only things you are seeing is the clusters are mislabelled. To reassign the Label it uses we use the np.choose() method. To do so you change the label position from [0,1,2] to [2,0,1]. The full code is given below.
relabel = np.choose(clustering.labels_,[2,0,1]).astype(np.int64) colors = np.array(["Red","Green","Blue"]) plt.subplot(1,2,1) plt.scatter(x=iris_df.petal_length ,y= iris_df.petal_width,c = colors[iris.target],s=50) plt.title("Before K Means Classificaion") plt.subplot(1,2,2) plt.scatter(x=iris_df.petal_length ,y= iris_df.petal_width,c = colors[relabel],s=50) plt.title("K means Classifcation")
Step 6: Evaluate the Accuracy of the Cluster Results
At the last step, you will verify the results for accuracy of the model. In order to do so, you use sklearn classification reports.
Before verifying the results know the following term.
Precision: It measures the relevancy of the model.
Recall: Measures the completeness of the model.
Highly Accurate Model Results = High Precision + High Recall
In our case, average Precision is 83% and the average Recall is 83% of the entire dataset. From these results, you can say our model is giving highly accurate results.
K means clustering model is a popular way of clustering the datasets that are unlabelled. But In the real world, you will get large datasets that are mostly unstructured. Thus to make it a structured dataset. You will use machine learning algorithms. There are also other types of clustering method. The type of the Clustering algorithms you will choose will completely depend upon the dataset.
I think you must have easily understood the K Mean Clustering algorithm. In order to get any help from our side, you can directly message us on the Data Science Learn Page. We are always ready to help you.
Data Science Learner Team
Join our list
Subscribe to our mailing list and get interesting stuff and updates to your email inbox.