There are some cases when you have a dataset that is mostly unlabeled. The problems start when you want to structure the datasets and make it valuable by labeling it. In machine learning, there are various methods for labeling these datasets. Clustering is one of them. In this tutorial of “**How to**“, you will learn to do **K Means Clustering in Python**.

## What is K Means Clustering Algorithm?

It is a clustering algorithm that is a simple Unsupervised algorithm used to predict groups from an unlabeled dataset. In the K Means clustering predictions are dependent or based on the two values.

**1.The number of cluster centers ( Centroid k)**

**2. Nearest Mean value between the observations.**

There are many popular use cases of the K Means Clustering and some of them are Price and cost Modeling of a Specific Market, Fraud Detection, Portfolio or Hedge Fund Mangement.

Before going in details and coding part of the K Mean Clustering in Python, you should keep in mind that Clustering always done on Scaled Variable (Normalized). It means the Mean should be zero and the sum of the covariance should be equal to one. And the other things to remember is the use of scatter plot or the data table for taking the estimated number of the centroids or the cluster **centers (k)**.

## Step 1: Import the necessary Library required for K means Clustering model

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pylab import rcParams
#sklearn
import sklearn
from sklearn.cluster import KMeans
from sklearn.preprocessing import scale # for scaling the data
import sklearn.metrics as sm # for evaluating the model
from sklearn import datasets
from sklearn.metrics import confusion_matrix,classification_report
```

## Step 2: Define the Parameters for the Visualization

```
%matplotlib inline
rcParams["figure.figsize"] =20,10
```

I am using the Jupyter notebook there for showing the figure inline, I am calling the statement **%matplotlib inline**.

## Step 3: Load and scale the Dataset.

I am loading the default sklearn Iris dataset. You can also use your own dataset. But for the demonstration, I am using the default dataset.

```
iris = datasets.load_iris()
#scale the data
data = scale(iris.data) # scale the iris data
target = pd.DataFrame(iris.target) # define the target
variable_names = iris.feature_names
data[0:10]
```

Here the data is the scaled data and the target is the **species** of the data.

Please note that the data[0:10] will return the np array only.

## Step 4: Build the Cluster Model and model the output

In this step, you will build the K means cluster model and will call the fit() method for the dataset. After that, you will mode the output for the data visualization.

```
clustering = KMeans(n_clusters=3,random_state=5)
#fit the dataset
clustering.fit(data)
iris_df = pd.DataFrame(iris.data)
iris_df.columns = ["sepal_length","sepal_width","petal_length","petal_width" ]
target.columns =["Target"]
```

The above output defines the KMeans() cluster method has been called. You can see there are various arguments are defined inside the method. The type of the algorithm, the number of clusters (n_clusters). e.t.c. You can know about it here. K-Means clustering

## Step 5: Plot the Model Output using Matplotlib

```
colors = np.array(["Red","Green","Blue"])
plt.subplot(1,2,1)
plt.scatter(x=iris_df["petal_length"] ,y= iris_df["petal_width"],c = colors[iris.target],s=50)
plt.title("Before K Means Classificaion")
plt.subplot(1,2,2)
plt.scatter(x=iris_df["petal_length"] ,y= iris_df["petal_width"],c = colors[clustering.labels_],s=50)
plt.title("K means Classifcation")
```

Both figures suggest that the model has accurately predicted clusters. The only things you are seeing is the clusters are mislabelled. To reassign the Label it uses we use the np.choose() method. To do so you change the label position from [0,1,2] to [2,0,1]. The full code is given below.

```
relabel = np.choose(clustering.labels_,[2,0,1]).astype(np.int64)
colors = np.array(["Red","Green","Blue"])
plt.subplot(1,2,1)
plt.scatter(x=iris_df.petal_length ,y= iris_df.petal_width,c = colors[iris.target],s=50)
plt.title("Before K Means Classificaion")
plt.subplot(1,2,2)
plt.scatter(x=iris_df.petal_length ,y= iris_df.petal_width,c = colors[relabel],s=50)
plt.title("K means Classifcation")
```

d

## Step 6: Evaluate the Accuracy of the Cluster Results

At the last step, you will verify the results for accuracy of the model. In order to do so, you use sklearn classification reports.

`print(classification_report(target,relabel))`

Before verifying the results know the following term.

**Precision**: It measures the relevancy of the model.

**Recall**: Measures the completeness of the model.

**Highly Accurate Model Results = High Precision + High Recall**

In our case, average Precision is 83% and the average Recall is 83% of the entire dataset. From these results, you can say our model is giving highly accurate results.

## Conclusion

K means clustering model is a popular way of clustering the datasets that are unlabelled. But In the real world, you will get large datasets that are mostly unstructured. Thus to make it a structured dataset. You will use machine learning algorithms. There are also other types of clustering method. The type of the Clustering algorithms you will choose will completely depend upon the dataset.

I think you must have easily understood the **K Mean Clustering** algorithm. In order to get any help from our side, you can directly message us on the Data Science Learn Page. We are always ready to help you.

**Thanks **

**Data Science Learner Team**

#### Join our list

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.