Step by Step Explanation of the KNN algorithm

Installing Packages

To implement the KNN algorithm in R programming , we need to install some packages includes class , ggplot2 , caret and GGally.

Process to install packages in the Rstudio.

We can install packages in R studio in two ways:

  • In the Rstudio go to tools, then click on tools , in tools we find install packages click on it then we find a tab , in that tab determine the required package name and click on install . These steps will successfully install the required packages.The below figure represents the tab that is shown when clicked on install packages.⇒ open Rstudio → click on tools →click on install packages → in install packages tab give package name →click on install in install packages tab.
  • We can also install the packages using the command install.packages(“package_name”) in the command prompt of Rstudio .The below figure represents the installation of packages using the command.⇒ open Rstudio → in console type install.packages(“package_name”) .

Importing Packages

In order to work with KNN algorithm we need to import the installed packages into our script . We load or import the packages into the Rscript by using the function library().Below lines represents the importing/loading of packages into a R script where class , caret, ggplot2 and GGally are the packages for different purpose.The purpose of each package is discussed below.

  • class – It is a package in R programming to work with the KNN algorithm and classification. It includes the functions like knn(), reduce.nn(),knn.cv() and many more.In this article we are importing this package to work the function knn().
  • caret – It is package in R to work with classification problems as well as with the regression problems.
  • ggplot2 – It is a pckage in R programming to create graphics. It is used for the purpose of the data visulaization.
  • GGally – It is package that is the extension of the package ggplot2 . This package will reduce the complexity of some functions.
  • library() – It is function in Rstudio used to load the specified package in the Rscript. We load many packages at a time in library() function. The syntax of library() function is library(“package1″,package2″……….”package n”).

R




library(class)
library(caret)
library(ggplot2)
library(GGally)


Accessing/Importing Dataset

After importing the required packages we need to load the data into the Rscript. We can the load or get the data into Rscript into two ways.Now let us discuss each of them.

  • We can load/acess the available datasets by using the function data().In Rstudion there are approximately 104 bult in datasetsare available.The below represents the code to load the dataset using the function data().In the code explained below we have used the built in dataset iris which has 150 rows and 5 columns.

R




data(iris)
iris


Output:

    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa


We accessing the data into our R script.

  • We can also load / access the data by using the function read.csv(). The below code represents the code to load the dataset using the function read.csv(). read.csv() function stores the accessed data in data frame format.We can download the datasets from kaggle.com website , other google source or we can create our own data.

Normalization

In KNN algorithm we use normalization to make all variables of data to same level. We can make the data to same level by using normalization or standardization. We can use normalization when there is a lot of difference in variable values,it is not necessary to use all the time.

R




normal_frame<-function(a)(
  return  (((a-min(a))/(max(a)-min(a))))
)
iris_new_frame<-as.data.frame(lapply(iris[,-5],normal_frame))
summary(iris_new_frame)


Output:

  Sepal.Length     Sepal.Width      Petal.Length     Petal.Width     
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
1st Qu.:0.2222 1st Qu.:0.3333 1st Qu.:0.1017 1st Qu.:0.08333
Median :0.4167 Median :0.4167 Median :0.5678 Median :0.50000
Mean :0.4287 Mean :0.4406 Mean :0.4675 Mean :0.45806
3rd Qu.:0.5833 3rd Qu.:0.5417 3rd Qu.:0.6949 3rd Qu.:0.70833
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000


We observed that the normalization function has created a output with same level of value for all variables

Creating test and training data

We know that the KNN algorithm is a supervised learning algorithm in which it has both training and test data. Supervised learning algorithms learn from the previously available data. Now we are dividing our available data into training data and testing data . We are creating 70% of our data as training data and remaining data as test data.Here we have created two train and two test datasets. In the first set of train and test data set we have created with out the class column(i.e., Species clumn) . In the second setwe have reated data set with the class column (i.e, including the Species column).

R




set.seed(1234)
data_ran<-sample(1:nrow(iris_new_frame),size = nrow(iris_new_frame)*0.7,replace = FALSE)
train_iris<-iris_new_frame[data_ran,]
test_iris<-iris_new_frame[-data_ran,]
 
train_iris_ran<-iris[data_ran,5]
test_iris_ran<-iris[-data_ran,5]


Model Creating

We are creating the KNN model in R with the help of the function knn().The below code represents the creation of model using the function knn() . IN knn() function we have given the values of training data set , test data set , training dataset which as the class variable(in this data set the class variable is species in fifth column),the value of K.

R




knnModel<-knn(train=train_iris,test=test_iris,cl=train_iris_ran,k=13)
summary(knnModel)


Output:

    setosa versicolor  virginica 
16 16 13


Performance of model

We evaluate the performance of the model by calculating the accuracy of the model.Accuracy tells that how accurately /correctly we are predicting the species based on the sepal length , sepal width , petal length and petal width.The below gives an idea how to calculate the accuracy of the model.

R




accuracy<-100*sum(test_iris_ran==knnModel)/NROW(test_iris_ran)
accuracy


Output:

[1] 95.55556


We can also know the performance parameters of the model by creating the confusion matrix for the model. In R programming we can create the confusion matrix by using the function confusionMatrix().This function can be used only when the caret is downloaded in the Rstudio.

R




table(knnModel,test_iris_ran)
confusionMatrix(table(knnModel,test_iris_ran))


Output:

            test_iris_ran
knnModel setosa versicolor virginica
setosa 16 0 0
versicolor 0 15 1
virginica 0 1 12
> confusionMatrix(table(knnModel,test_iris_ran))
Confusion Matrix and Statistics
test_iris_ran
knnModel setosa versicolor virginica
setosa 16 0 0
versicolor 0 15 1
virginica 0 1 12
Overall Statistics

Accuracy : 0.9556
95% CI : (0.8485, 0.9946)
No Information Rate : 0.3556
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.933

Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 0.9375 0.9231
Specificity 1.0000 0.9655 0.9688
Pos Pred Value 1.0000 0.9375 0.9231
Neg Pred Value 1.0000 0.9655 0.9688
Prevalence 0.3556 0.3556 0.2889
Detection Rate 0.3556 0.3333 0.2667
Detection Prevalence 0.3556 0.3556 0.2889
Balanced Accuracy 1.0000 0.9515 0.9459


Visualization

R




ggplot(aes(Sepal.Length,Petal.Width),data=iris)+
geom_point(aes(color=factor(Species)))


Output:

kNN algorithm in R from scratch

Applications of KNN Algorithm

  1. KNN algorithm is used for classifying images in image recognition.
  2. KNN algorithm can be used in text categeorization task.
  3. It is useful for the detection of spam messages and spam mails.
  4. KNN algorithm can also be used for the stock prediction , house price prediction ,weather prediction , market segmentation and real estate.
  5. KNN algorithm can be used for the identification of fraud activities in financial transactions.
  6. It can be used for the detection of unusual network traffic patterns.
  7. KNN algorithm can be used in drug discovery and disease diagnosis.
  8. It is helpful in the recognition of hand writing and face patterns.
  9. KNN algorithms is useful to navigate the robots . It is helpful for robotics and robot motion planning.

Advanatges of KNN Algorithm

  1. KNN algorithm is a simple algorithm.
  2. It is an easy algorithm to implement.
  3. KNN algorithm is a lazy learning algorithm.It doesn’t have training phase.
  4. As, KNN algorithm is a lazy learning algorithm and build model at the time of prediction, it is suitable for dynamic and changing datasets .
  5. KNN algorithm show versatality .KNN algorithm is suitable to implement both regression and classification problems.
  6. KNN algorithm has the ability to deal with both qualitative and quantitative data (i.e., categorical and numerical data).
  7. It is less sensitivity to outliers when compared with other algorithms.
  8. KNN algorithm can implement complex patterns and easily acquire local structure of data.

Disadvantages of KNN Algorithm

  1. KNN algorithm has complexity for calculating the distances.
  2. It requires more space to store the training dataset.
  3. The performance of the algorithm decreases as the number of dimensions of the dataset increases.
  4. The performance of the algorithm also depends on the value of k. The small value of k leads to noise while large value of k leads to reduced sensitivity.
  5. This algorithm is sensitive to noisy data, outliers and irrelevant features.

Conclusion

In this article we have learned about the KNN algorithm and the steps to implement the KNN algorithm. We have also learned about the implementation of the KNN algorithm in R programming language . We also learned about the applications , advantages and disadvantages of the KNN algorithm in detail.



kNN: k-Nearest Neighbour Algorithm in R From Scratch

In this article, we are going to discuss what is KNN algorithm, how it is coded in R Programming Language, its application, advantages and disadvantages of the KNN algorithm.

Similar Reads

kNN algorithm in R

KNN can be defined as a K-nearest neighbor algorithm. It is a supervised learning algorithm that can be used for both classification and regression tasks. It is the simplest algorithm that can be applied in machine learning, data analytics, and data science.KNN algorithm assigns labels to the testing data set based on the class labels of the training data set. It is a lazy learning algorithm because there is no learning that happens in the real sense.KNN algorithm can be applied to both categorical and numerical data. In this article we are going to discuss the KNN algorithm in detail and how it can be implemented on R programming language....

Step by step explanation of the KNN algorithm code from the scratch

Let us now implement the above provided example in R programming from scratch....

Step by Step Explanation of the KNN algorithm

...