KMeans Clustering with Iris Dataset
K-means clustering is an Unsupervised machine learning algorithm.
- First, choose the clusters K
- Randomly select k centroids from the whole dataset
- Assign all points to the closest cluster centroid
- Recompute centroids again for new clusters
- now repeat steps 3 and 4 until centroids converge
Python3
wcss = [] for i in range ( 1 , 11 ): kmeans = KMeans(n_clusters = i, init = 'k-means++' , max_iter = 300 , n_init = 10 , random_state = 0 ) kmeans.fit(x) wcss.append(kmeans.inertia_) # from above array with help of elbow method #we can get no of cluster to provide. kmeans = KMeans(n_clusters = 3 , init = 'k-means++' , max_iter = 300 , n_init = 10 , random_state = 0 ) y_kmeans = kmeans.fit_predict(x) |
In the above code, we have used the elbow method to get the optimized value of k. If we plot a graph for it we get a value of 3.
Visualizing the Clusters
Python3
# Visualising the clusters cols = iris.columns plt.scatter(X.loc[y_kmeans = = 0 , cols[ 0 ]], X.loc[y_kmeans = = 0 , cols[ 1 ]], s = 100 , c = 'purple' , label = 'Iris-setosa' ) plt.scatter(X.loc[y_kmeans = = 1 , cols[ 0 ]], X.loc[y_kmeans = = 1 , cols[ 1 ]], s = 100 , c = 'orange' , label = 'Iris-versicolour' ) plt.scatter(X.loc[y_kmeans = = 2 , cols[ 0 ]], X.loc[y_kmeans = = 2 , cols[ 1 ]], s = 100 , c = 'green' , label = 'Iris-virginica' ) # Plotting the centroids of the clusters plt.scatter(kmeans.cluster_centers_[:, 0 ], kmeans.cluster_centers_[:, 1 ], s = 100 , c = 'red' , label = 'Centroids' ) plt.legend() |
Output:
Accuracy and Performance of Model
Now let’s check the performance of the model.
Python3
pd.crosstab(iris.target, y_kmeans) |
Output:
As the algorithm is an unsupervised algorithm we don’t have test data here to check the performance of the model on it. Setosa class is clustered perfectly. While Versicolor has only 2 misclassifications. Class virginica is getting overlapped Versicolor hence there is 14 misclassifications.
Analyzing Decision Tree and K-means Clustering using Iris dataset
Iris Dataset is one of best know datasets in pattern recognition literature. This dataset contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2 the latter are NOT linearly separable from each other.
Attribute Information:
- Sepal Length in cm
- Sepal Width in cm
- Petal Length in cm
- al Width in cm
- Class:
- Iris Setosa
- Iris Versicolour
- Iris Virginica
Let’s perform Exploratory data analysis on the dataset to get our initial investigation right.