K Means Clustering in Python : Label the Unlabeled Data

There are some cases when you have a dataset that is mostly unlabeled. The problems start when you want to structure the datasets and make it valuable by labeling it. In machine learning, there are various methods for labeling these datasets. Clustering is one of them. In this tutorial of “How to“, you will learn to do K Means Clustering in Python.

What is K Means Clustering Algorithm?

It is a clustering algorithm that is a simple Unsupervised algorithm used to predict groups from an unlabeled dataset. In Unsupervised machine learning, you don’t need to supervise the model. Here the model does its own work to find the patterns in the dataset. And then it automatically labels the unlabeled data.

In the K Means clustering predictions are dependent or based on the two values.

1.The number of cluster centers ( Centroid k)

2. Nearest Mean value between the observations.

There are many popular use cases of the K Means Clustering and some of them are Price and cost Modeling of a Specific Market, Fraud Detection, Portfolio or Hedge Fund Management.

Before going into details and coding part of the K Mean Clustering in Python, you should keep in mind that Clustering is always done on Scaled Variable (Normalized). It means the Mean should be zero and the sum of the covariance should be equal to one. And the other things to remember is the use of a scatter plot or the data table for taking the estimated number of the centroids or the cluster centers (k).

Step 1: Import the necessary Library required for K means Clustering model

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pylab import rcParams
#sklearn 
import sklearn
from sklearn.cluster import KMeans 
from sklearn.preprocessing import scale # for scaling the data
import sklearn.metrics as sm # for evaluating the model
from sklearn import datasets
from sklearn.metrics import confusion_matrix,classification_report

Step 2: Define the Parameters for the Visualization

%matplotlib inline 
rcParams["figure.figsize"] =20,10

I am using the Jupyter notebook there for showing the figure inline, I am calling the statement %matplotlib inline.

Step 3: Load and scale the Dataset.

I am loading the default sklearn Iris dataset. You can also use your own dataset. But for the demonstration, I am using the default dataset.

iris = datasets.load_iris()
#scale the data
data = scale(iris.data) # scale the iris data
target = pd.DataFrame(iris.target) # define the target 
variable_names = iris.feature_names
data[0:10]

Here the data is the scaled data and the target is the species of the data.

Please note that the data[0:10] will return the np array only.

Step 4: Build the Cluster Model and model the output

In this step, you will build the K means cluster model and will call the fit() method for the dataset. After that, you will mode the output for the data visualization.

clustering = KMeans(n_clusters=3,random_state=5)
#fit the dataset
clustering.fit(data)

iris_df = pd.DataFrame(iris.data)
iris_df.columns = ["sepal_length","sepal_width","petal_length","petal_width" ]
target.columns =["Target"]

The above output defines the KMeans() cluster method has been called. You can see there are various arguments defined inside the method. The type of the algorithm, the number of clusters (n_clusters). e.t.c. You can know about it here. K-Means clustering

Step 5: Plot the Model Output using Matplotlib

colors = np.array(["Red","Green","Blue"])
plt.subplot(1,2,1)
plt.scatter(x=iris_df["petal_length"] ,y= iris_df["petal_width"],c = colors[iris.target],s=50)
plt.title("Before K Means Classificaion")

plt.subplot(1,2,2)
plt.scatter(x=iris_df["petal_length"] ,y= iris_df["petal_width"],c = colors[clustering.labels_],s=50)
plt.title("K means Classifcation")

Both figures suggest that the model has accurately predicted clusters. The only thing you are seeing is the clusters are mislabelled. To reassign the Label it uses we use the np.choose() method. To do so you change the label position from [0,1,2] to [2,0,1]. The full code is given below.

relabel = np.choose(clustering.labels_,[2,0,1]).astype(np.int64)
colors = np.array(["Red","Green","Blue"])
plt.subplot(1,2,1)
plt.scatter(x=iris_df.petal_length ,y= iris_df.petal_width,c = colors[iris.target],s=50)
plt.title("Before K Means Classificaion")

plt.subplot(1,2,2)
plt.scatter(x=iris_df.petal_length ,y= iris_df.petal_width,c = colors[relabel],s=50)
plt.title("K means Classifcation")

Step 6: Evaluate the Accuracy of the Cluster Results

In the last step, you will verify the results for the accuracy of the model. We can use the Elbow method to validate the model.

Before verifying the results know the following term.

Conclusion

K means clustering model is a popular way of clustering the datasets that are unlabelled. But In the real world, you will get large datasets that are mostly unstructured. Thus to make it a structured dataset. You will use machine learning algorithms. There are also other types of clustering methods. The type of Clustering algorithms you will choose will completely depend upon the dataset.

I think you must have easily understood the K Mean Clustering algorithm. In order to get any help from our side, you can directly message us on the Data Science Learn Page. We are always ready to help you.

Thanks

Data Science Learner Team

Source:

K Means Clustering Documentation