K MEANS CLUSTERING IN PYTHON

K Means Clustering in Python : Label the Unlabeled Data

There are some cases when you have a dataset that is mostly unlabeled. The problems start when you want to structure the datasets and make it valuable by labeling it. In machine learning, there are various methods for labeling these datasets. Clustering is one of them. In this tutorial of “How to“, you will learn to do K Means Clustering in Python.

What is K Means Clustering Algorithm?

It is a clustering algorithm that is a simple Unsupervised algorithm used to predict groups from an unlabeled dataset. In the K Means clustering predictions are dependent or based on the two values.

1.The number of cluster centers ( Centroid k)

2. Nearest Mean value between the observations.

There are many popular use cases of the K Means Clustering and some of them are Price and cost Modeling of a Specific Market, Fraud Detection, Portfolio or Hedge Fund Mangement.

Before going in details and coding part of the K Mean Clustering in Python, you should keep in mind that Clustering always done on Scaled Variable (Normalized). It means the Mean should be zero and the sum of the covariance should be equal to one. And the other things to remember is the use of scatter plot or the data table for taking the estimated number of the centroids or the cluster centers (k).

Step 1: Import the necessary Library required for K means Clustering model

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pylab import rcParams
#sklearn 
import sklearn
from sklearn.cluster import KMeans 
from sklearn.preprocessing import scale # for scaling the data
import sklearn.metrics as sm # for evaluating the model
from sklearn import datasets
from sklearn.metrics import confusion_matrix,classification_report

Step 2: Define the Parameters for the Visualization

%matplotlib inline 
rcParams["figure.figsize"] =20,10

I am using the Jupyter notebook there for showing the figure inline, I am calling the statement %matplotlib inline.

Step 3: Load and scale the Dataset.

I am loading the default sklearn Iris dataset. You can also use your own dataset. But for the demonstration, I am using the default dataset.

iris = datasets.load_iris()
#scale the data
data = scale(iris.data) # scale the iris data
target = pd.DataFrame(iris.target) # define the target 
variable_names = iris.feature_names
data[0:10]

Here the data is the scaled data and the target is the species of the data.

iris data

Please note that the data[0:10] will return the np array only.

Step 4: Build the Cluster Model and model the output

In this step, you will build the K means cluster model and will call the fit() method for the dataset. After that, you will mode the output for the data visualization.

clustering = KMeans(n_clusters=3,random_state=5)
#fit the dataset
clustering.fit(data)

iris_df = pd.DataFrame(iris.data)
iris_df.columns = ["sepal_length","sepal_width","petal_length","petal_width" ]
target.columns =["Target"]

k mean clustering model

The above output defines the KMeans() cluster method has been called. You can see there are various arguments are defined inside the method. The type of the algorithm, the number of clusters (n_clusters). e.t.c.  You can know about it here. K-Means clustering

Step 5: Plot the Model Output using Matplotlib

colors = np.array(["Red","Green","Blue"])
plt.subplot(1,2,1)
plt.scatter(x=iris_df["petal_length"] ,y= iris_df["petal_width"],c = colors[iris.target],s=50)
plt.title("Before K Means Classificaion")

plt.subplot(1,2,2)
plt.scatter(x=iris_df["petal_length"] ,y= iris_df["petal_width"],c = colors[clustering.labels_],s=50)
plt.title("K means Classifcation")

k mean clustering model output

Both figures suggest that the model has accurately predicted clusters. The only things you are seeing is the clusters are mislabelled. To reassign the Label it uses we use the np.choose() method. To do so you change the label position from [0,1,2] to [2,0,1]. The full code is given below.

relabel = np.choose(clustering.labels_,[2,0,1]).astype(np.int64)
colors = np.array(["Red","Green","Blue"])
plt.subplot(1,2,1)
plt.scatter(x=iris_df.petal_length ,y= iris_df.petal_width,c = colors[iris.target],s=50)
plt.title("Before K Means Classificaion")

plt.subplot(1,2,2)
plt.scatter(x=iris_df.petal_length ,y= iris_df.petal_width,c = colors[relabel],s=50)
plt.title("K means Classifcation")

d

k mean clustering model after relabeling

Step 6: Evaluate the Accuracy of the Cluster Results

At the last step, you will verify the results for accuracy of the model. In order to do so, you use sklearn classification reports.

print(classification_report(target,relabel))

Before verifying the results know the following term.

Precision: It measures the relevancy of the model.

Recall: Measures the completeness of the model.

Highly Accurate Model Results = High Precision + High Recall

In our case, average Precision is 83% and the average Recall is 83% of the entire dataset. From these results, you can say our model is giving highly accurate results.

Conclusion

K means clustering model is a popular way of clustering the datasets that are unlabelled. But In the real world, you will get large datasets that are mostly unstructured. Thus to make it a structured dataset. You will use machine learning algorithms. There are also other types of clustering method. The type of the Clustering algorithms you will choose will completely depend upon the dataset.

I  think you must have easily understood the K Mean Clustering algorithm. In order to get any help from our side, you can directly message us on the Data Science Learn Page. We are always ready to help you.

Thanks 

Data Science Learner Team

 

Join our list

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.

Thank you for signup. A Confirmation Email has been sent to your Email Address.

Something went wrong.

Share via
 
Thank you For sharing.We appreciate your support. Don't Forget to LIKE and FOLLOW our SITE to keep UPDATED with Data Science Learner