KNNs (K-Nearest-Neighbours) in Python

The Nearest Neighbours algorithm is an optimization problem that was initially formulated in tech literature by Donald Knuth. The key behind the idea was to find out into which group of classes a random point in the search space belongs to, in a binary class, multiclass, continuous. unsupervised, or semi-supervised algorithm. Sounds mathematical? Let’s make it simple.

Imagine you are a shopkeeper who sells online. And you are trying to group your customers in such a way that products that are recommended for them come on the same page. Thus, a customer in India who buys a laptop will also buy a mouse, a mouse-pad, speakers, laptop sleeves, laptop bags, and so on. Thus you are trying to group this customer into a category. A class. How do you do this if you have millions of customers and over 100,000 products? Manual programming would not be the way to go. Here, the nearest neighbours method comes to the rescue.

You can group your customers into classes (for e.g. Laptop-Buyer, Gaming-Buyer, New-Mother, Children~10-years-old) and based upon what other people in those classes have bought in the past, you can choose to show them the items that they are the most likely to buy next, making their online shopping experience much easier and much more streamlined. How will you choose that? By grouping your customers into classes, and when a new customer comes, choosing which class he belongs to and showing him the products relevant for his class.

This is the essence of the ML algorithm that platforms such as Amazon and Flipkart use for every customer. Their algorithms are much more complex, but this is their essence.

The Nearest Neighbours topic can be divided into the following sub-topics:

  1. Brute-Force Search

  2. KD-Trees

  3. Ball-Trees

  4. K-Nearest Neighbours

Out of all of these, K-Nearest Neighbours (always referred to as KNNs) is by far the most commonly used.

K-Nearest Neighbours (KNNs)

A KNN algorithm is very simple, yet it can be used for some very complex applications and arcane dataset distributions. It can be used for binary classification, multi-class classification, regression, clustering, and even for creating new-algorithms that are state-of-the-art research techniques (e.g. https://www.hindawi.com/journals/aans/2010/597373/  – A Research Paper on a fusion of KNNs and SVMs). Here, we will describe an application of KNNs known as binary classification. On an extremely interesting dataset from the UCI-Repository (sonar.mines-vs-rocks).

Implementation

The algorithm of a KNN ML model is given below:

K-Nearest Neighbours

Again, mathematical! Let’s break it into small steps one at a time:

How the Algorithm Works

This explanation is for supervised learning binary classification.

Here we have two classes. We’ll call them A and B.

So the dataset is a collection of values which belong either to class A or class B.

A visual plot of the (arbitrary) data might look something like this:

Now, look at the star data point in the centre. To which class does it belong? A or B?

The answer? It varies according to the hyperparameters we use. In the above diagram, k is a hyperparameter.

They significantly affect the output of a machine learning (ML) algorithm when correctly tuned (set to the right values).

The algorithm then computes the ‘k’ points closest to the new point. The output is shown above when k = 3 and when k = 6 (k being the number of closest neighbouring points to indicate which class the new point belongs to).

Finally, we return a class as output which is closest to the new data point, according to various measures.  The measures used include Euclidean distance among others.

This is how the K Nearest Neighbours algorithm works in principle. As you can see, visualizing the data is a big help to get an intuitive picture of what the k values should be.

Now, let’s see the K-Nearest-Neighbours Algorithm work in practice.

Note: This algorithm is powerful and highly versatile. It can be used for binary classification, multi-class classification, regression, clustering, and so on.  Many use-cases are available for this algorithm which is quite simple but remarkably powerful, so make sure you learn it well so that you can use it in your projects.

Obtain the Data and Preprocess it

We shall use the data from the UCI Repository, available at the following link:

http://archive.ics.uci.edu/ml/datasets/connectionist+bench+(sonar,+mines+vs.+rocks) 

It needs to be manually converted into a CSV file, which is available at the following link:

https://github.com/selva86/datasets/blob/master/Sonar.csv

This data is a  set of 207 sonar underwater readings by a submarine that have to be classified as rocks or underwater mines. Save the CSV file in the same directory as your Python source file and perform the following operations:

Import the required packages first:

  import numpy as npimport pandas as pdimport scipy as sp  from datetime import datetimefrom sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_splitfrom sklearn.metrics.classification import accuracy_scorefrom sklearn.metrics.classification import confusion_matrixfrom sklearn.metrics.classification import classification_report

import pandas as pd

 

from datetime import datetime

 

from sklearn.metrics.classification import accuracy_score

from sklearn.metrics.classification import classification_report

Read the CSV dataset into your Python environment. And check out the top 5 rows using the head() Pandas DataFrame function.

  > df = pd.read_csv(“sonar.all-data.csv”)df.head()0.0200  0.0371  0.0428  0.0207  0.0954 …  0.0180  0.0084  0.0090  0.0032  R0  0.0453  0.0523  0.0843  0.0689  0.1183 …  0.0140  0.0049  0.0052  0.0044  R1  0.0262  0.0582  0.1099  0.1083  0.0974 …  0.0316  0.0164  0.0095  0.0078  R2  0.0100  0.0171  0.0623  0.0205  0.0205 …  0.0050  0.0044  0.0040  0.0117  R3  0.0762  0.0666  0.0481  0.0394  0.0590 …  0.0072  0.0048  0.0107  0.0094  R4  0.0286  0.0453  0.0277  0.0174  0.0384 …  0.0057  0.0027  0.0051  0.0062

df.head()

0  0.0453  0.0523  0.0843  0.0689  0.1183 …  0.0140  0.0049  0.0052  0.0044  R

2  0.0100  0.0171  0.0623  0.0205  0.0205 …  0.0050  0.0044  0.0040  0.0117  R

4  0.0286  0.0453  0.0277  0.0174  0.0384 …  0.0057  0.0027  0.0051  0.0062

Sonar Reading for Classification ML Problem

Check how much data you have and what its dimensions are:

  > df.describe()0.0200      0.0371     …          0.0090      0.0032count  207.000000  207.000000     …      207.000000  207.000000mean     0.029208    0.038443     …        0.007936    0.006523std      0.023038    0.033040     …        0.006196    0.005038min      0.001500    0.000600     …        0.000100    0.00060025%      0.013300    0.016400     …        0.003650    0.00310050%      0.022800    0.030800     …        0.006300    0.00530075%      0.035800    0.048100     …        0.010350    0.008550max      0.137100    0.233900     …        0.036400    0.043900

0.0200      0.0371     …          0.0090      0.0032

mean     0.029208    0.038443     …        0.007936    0.006523

min      0.001500    0.000600     …        0.000100    0.000600

50%      0.022800    0.030800     …        0.006300    0.005300

max      0.137100    0.233900     …        0.036400    0.043900

 

Now, the last column is a letter. We need to encode it into a numerical value. For this, we can use LabelEncoder, as below:

  #Inputs (data values) sonar readings from an underground submarine. Cool!X = df.values[:,0:-1].astype(float) # Convert classes M (Mine) and R to numbers, since they’re categorical valuesle = LabelEncoder() #Classification targettarget = df.R# Do conversionle = LabelEncoder.fit(le, y = [“R”, “M”])y = le.transform(target)›

X = df.values[:,0:-1].astype(float)

Convert classes M (Mine) and R to numbers, since they’re categorical values

 

target = df.R

le = LabelEncoder.fit(le, y = [“R”, “M”])

Now have a look at your target dataset. R (rock) and M (mine) has been converted into 1 and 0.

  yarray([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,       1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,       0, 0, 0, 0, 0, 0, 0, 0, 0])

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

       1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

Examine your  pre-scaled input numpy array, X:

  Xarray([[0.0453, 0.0523, 0.0843, …, 0.0049, 0.0052, 0.0044],       [0.0262, 0.0582, 0.1099, …, 0.0164, 0.0095, 0.0078],       [0.01  , 0.0171, 0.0623, …, 0.0044, 0.004 , 0.0117],       …,       [0.0522, 0.0437, 0.018 , …, 0.0138, 0.0077, 0.0031],       [0.0303, 0.0353, 0.049 , …, 0.0079, 0.0036, 0.0048],       [0.026 , 0.0363, 0.0136, …, 0.0036, 0.0061, 0.0115]])

array([[0.0453, 0.0523, 0.0843, …, 0.0049, 0.0052, 0.0044],

       [0.01  , 0.0171, 0.0623, …, 0.0044, 0.004 , 0.0117],

       [0.0522, 0.0437, 0.018 , …, 0.0138, 0.0077, 0.0031],

       [0.026 , 0.0363, 0.0136, …, 0.0036, 0.0061, 0.0115]])

Execute the train_test_split partition function. This splits the inputs into 4 separate numpy arrays. We can control how the input data is split using the test_size or train_size parameters. Here the test size parameter is set to 0.3. Thus, 30% of the data goes into  the test set and the remaining 70% (the complement) into the training set. We train (fit) the ML model on the training arrays and see how accurate our modes are on the test set. By default, the value is set to 0.25 (25%, 75%). Normally this sampling is randomized, so different results appear while being run each time. Setting random_state to a fixed value (any fixed value) makes sure that the same values are obtained every time we execute the model.

  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

Fit the KNN classifier to the dataset.

  #Train kneighbors classifierfrom sklearn.neighbors.classification import KNeighborsClassifierclf = KNeighborsClassifier(n_neighbors = 5, metric = “minkowski”, p = 1) # Fit the modelclf.fit(X_train, y_train)KNeighborsClassifier(algorithm=’auto’, leaf_size=30, metric=’minkowski’,           metric_params=None, n_jobs=None, n_neighbors=5, p=1,           weights=’uniform’)

from sklearn.neighbors.classification import KNeighborsClassifier

 

clf.fit(X_train, y_train)

           metric_params=None, n_jobs=None, n_neighbors=5, p=1,

As of now , it is all right (at this level) to leave the defaults as they are. The output of the KNeighborClassifier has two values that you do need to know: metric and p. Right now we just need the Manhattan Distance, specified by p = 1 and metric = “minkowski**“, so we’ll go with that, which specifies Manhattan distance, which is, the distance between two points measured along axes at right angles. In a plane with p1 at (x1, y1) and p2 at (x2, y2), it is x1 – x2 + y1 – y2 .** (Source: https://xlinux.nist.gov/dads/HTML/manhattanDistance.html)

Output the statistical scoring of this classification model.

12345678910111213141516171819 predicted = clf.predict(X_test)print(“Accuracy:”)print(accuracy_score(y_test, predicted))print(“Confusion Matrix:”)print(confusion_matrix(y_test, predicted))print(“Classification Report:”)print(classification_report(y_test, predicted))Accuracy:0.8253968253968254Confusion Matrix:[[32  3] [ 8 20]]Classification Report:             precision    recall  f1-score   support           0       0.80      0.91      0.85        35          1       0.87      0.71      0.78        28 avg / total       0.83      0.83      0.82        63

2

4

6

8

10

12

14

16

18

print(“Accuracy:”)

print(“Confusion Matrix:”)

print(“Classification Report:”)

Accuracy:

Confusion Matrix:

[ 8 20]]

             precision    recall  f1-score   support

          0       0.80      0.91      0.85        35

 

Accuracy on the set was 82%. Not bad, since my first implementation was on random forests classifier and top score was just 72%!

The entire program as source code  in Python is available here as a downloadable sonar-classification.txt file (rename *.txt to *.py and you’re good to go.):

To learn more about how k-nearest neighbours are used in practice, do check out the following excellent article on our blog:

https://dimensionless.in/spam-detection-with-natural-language-processing-part-3/ 

The following article is also an excellent reference for KNNs:

https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm 

Takeaways

K-Nearest-Neighbours is a powerful algorithm to have in your machine learning classification arsenal. It is used so frequently that most clustering models always start with KNNs first. Use it, learn it in depth, and it will be incredibly useful to you in your entire data science career. I highly recommend the Wikipedia article since it covers nearly all applications of KNNs and much more.

Finally, Understand the Power of Machine Learning

Imagine trying to create a classical reading of this sonar-reading with 60 features, trying to solve this reading from a non-machine learning environment. You would have to load a 207 X 61 ~ 12k samples. Then you would have to develop an algorithm by hand to analyze the data!

Scikit-Learn, TensorFlow, Keras, PyTorch, AutoKeras bring such fantastic abilities to computers with respect to problems that could not be solved in the past before ML came along.

And this is just the beginning!

Automation is the Future

As ML and AI applications take root in our world more and more, humanity will be replaced by ‘intelligent’ software programs that perform operations like a human. We have chatbots in many companies already.  Self-driving cars daily push the very limits of what we think a machine can do. The only question is, will you be on the side that is being replaced or will you be on the new forefront of technological progress? Get reskilled. Or just, start learning! Today, as soon as you can.

Please follow and like us: