Introduction and a detailed explanation of the k Nearest Neighbors Algorithm

What is kNN?

k-Nearest Neighbors is one of the easiest Machine Learning algorithms. It is a “Classification” algorithm to be specific. But due to its generic procedure, it can be also used for feature selection, outlier detection(Wilson editing), and missing value imputations. It is also called Instance-Based Learning and Lazy Learning because at training time it does nothing! In the kNN, the hyper-parameter is “k”.

Working of kNN

kNN has a simple working mechanism. I will explain it in 4 steps. When a test point comes in, this is what we do in kNN,

Fix the value of k
Find k nearest neighbors by Euclidean distance formula( or any distance finding algorithm )
Vote the class labels
Prediction

Let me illustrate kNN with a simple example. Let us assume that our data set has 3 class labels( A, B, C). Let us fix the value of k as 3 i.e we find 3 nearest neighbors. Now when a test point comes in, we find the 3 nearest neighbors in our data set. Let us assume that our algorithm gave us the 3 nearest neighbors as A, A, C. Since, the test point must belong to only one class, we have to select only one out of A, A, C. We introduce a voting mechanism now since A’s are 2 and C’s are 1. “A” wins the game and we assign the test point belongs to the class label “A”. It is as simple as that!

Now, let us look at the detailed explanation with code.

Explanation of kNN Algorithm

0. Import the required libraries

from sklearn import datasets
import pandas as pd
import numpy as np
import seaborn as sns
import random as rnd
import csv
import random
import math
import operator
%matplotlib inline

Let us work on the famous Iris dataset. Look at the Iris data set if you don’t know about it. The first step is to load the data set and then splitting our data set into training data and test data.

1. Load the dataset and split the dataset

The Iris data set can be loaded in 2 ways. One way is loading it from sklearn library and the second way is to load it from your local desktop. Let’s follow the second approach.

def loadDataset(filename, split, trainingSet=[] , testSet=[]):
with open(filename, 'r') as csvfile:
lines = csv.reader(csvfile)
dataset = list(lines)
for x in range(len(dataset)-1):
for y in range(4):
dataset[x][y] = float(dataset[x][y])
if random.random() < split:
trainingSet.append(dataset[x])
else:
testSet.append(dataset[x])

loadDataset is the function of loading our data set. We open the file in read mode as a CSV file. Since our dataset is of comma-separated values, we use CSV reader. We run a loop, and, in each iteration, we convert the data type and split the data point into the train or test depending upon the random function’s returned value.

By default, the values in the data set are read as Strings. But the values ( sepal length, sepal width, petal length and, petal width ) should be of float type. So we convert them into floating type.

Now, we have to split our dataset into 2 parts. We randomly split the dataset and populate the training and test data set.

Let’s call this function.

trainingSet=[]
testSet=[]
split = 0.67
loadDataset('iris.csv', split, trainingSet, testSet)
print('Train set is ', repr(len(trainingSet)))
print('Test set is ' , repr(len(testSet)))

After this is done, we can see that our data set has been successfully split into train and test. Moving onto the next step.

2. Defining distance and neighbors function

def euclideanDistance(instance1, instance2, length):
distance = 0
for x in range(length):
distance += pow((instance1[x] - instance2[x]), 2)
return math.sqrt(distance)

def getNeighbors(trainingSet, testInstance, k):
distances = []
length = len(testInstance)-1
for x in range(len(trainingSet)):
dist = euclideanDistance(testInstance, trainingSet[x], length)
distances.append((trainingSet[x], dist))
distances.sort(key=operator.itemgetter(1))
neighbors = []
for x in range(k):
neighbors.append(distances[x][0])
return neighbors

As I have said earlier, in the kNN algorithm, we measure the distance between 2 points by Euclidean principle. This exact thing is done in the “euclideanDistance” function. If we have 3 dimensions, then the formula becomes √(x2−x1)² + (y2−y1)²+ (z2−z1)² But in our dataset, we have four dimensions/columns ( sepal length, sepal width, petal length and, petal width ) so we add an extra term in the above formula and then return the distance. “length” parameter is the number of dimensions in our dataset i.e 4.

In the getNeighbors function, given a test point, the entire training data, the value of “k”, the function should return k nearest neighbors of that test point in the entire training set. To achieve this, we run a loop for the entire training data set, and in each iteration, we find the distance between the test point and training point. After calculating all the distances, we sort the distances in ascending order because we need only the first “k” nearest neighbors. We return the nearest neighbors.

3. Voting

def getResponse(neighbors):
classVotes = {}
for x in range(len(neighbors)):
response = neighbors[x][-1]
if response in classVotes:
classVotes[response] += 1
else:
classVotes[response] = 1
sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True)
return sortedVotes[0][0]

As in the previous example, when we had 2 A’s and 1 C, we have chosen A as the class label for the test point. In this function, we exactly do that. After finding k nearest neighbors, we have to vote the class labels. This can be easily done by populating classVotes dictionary. After populating the dictionary, we need to sort the dictionary so as to see who the winner is! And we return the winner!

4. Finding the accuracy of our model

Creating a model is not enough, the model’s accuracy should be reasonably good. Now, accuracy is the (total predictions made by our model divided by the correct predictions). Checking out the accuracy is simple. It is as follows,

def getAccuracy(testSet, predictions):
correct = 0
for x in range(len(testSet)):
if testSet[x][-1] == predictions[x]:
correct += 1
return (correct/float(len(testSet))) * 100.0

All our predictions are stored in the “predictions” list. Our correct predictions are stored in our test data. So we compare the predicted class labels with the original class labels and finally return the accuracy.

5. Testing our model

Here comes the important step which is “Testing our model”. We test our model on the test data and find the accuracy of our model.

predictions=[]
k = 3
for x in range(len(testSet)):
neighbors = getNeighbors(trainingSet, testSet[x], k)
result = getResponse(neighbors)
predictions.append(result)
print('Predicted is ' + repr(result) + ' Actual is ' + repr(testSet[x][-1]))
accuracy = getAccuracy(testSet, predictions)
print('Accuracy: ' + repr(accuracy) + '%')

We have chosen the value of “k” as 3. and for each test point in the test data set, we find its nearest neighbors and vote it. We store all our predictions in the “predictions” list. Finally, we check the accuracy of our model!

And that’s it, we have coded up the kNN algorithm from scratch!! Isn’t it easy and fun?

Problems of kNN

So far, we have seen how the kNN works and where it can be used, but what are its problems? In short, it has 2 major problems.

The first one is “Large computation time”. Since at train time, it does nothing and at test time, the distances should be computed. The computation of distance highly depends on our number of features/dimensions/columns in the dataset. If there are more features, then the computation of distance becomes time-consuming.

The second thing, “Large model size”. Since, at train time it does nothing, in kNN our model is data. Data=model in kNN. If our data is huge. Then our model is also huge, which is a problem.

I hope you have gained some knowledge of kNN. If you have liked the content, please hit the “Clap” button and,

Connect with me -

LinkedIn : https://linkedin.com/in/bomma-pranay GitHub : https://github.com/Bomma-Pranay

--- By Bomma Pranay A Data Science Enthusiast