What is K-Nearest Neighbours in Machine Learning?
Introduction
Supervised machine learning has two primary tasks that it performs – either regression or classification. The k-nearest neighbours algorithm is used for both of these techniques. It has an interesting method of measuring the similarity between the two data sets by measuring the distance between them. When an input is given to the k-nearest neighbors algorithm, it chooses k items closest to the input data and figures out the classes that correspond to them. The class with the maximum occurrences or the highest probability of occurring is provided with the label of the class belonging to the input data that the k-nearest neighbors model has received. While in classification problems, a class is predicted with the technique of choosing the most frequently occurring class, in regression, the output is the average of the k items selected by the algorithm.
In this article, we will understand the intuition behind the k-nearest neighbors algorithm, its working, the different ways it is implemented, and the hyperparameters of this algorithm. Following the theoretical aspect of the algorithm, the author has also concluded its pros and cons, enabling leaders to decide on the utilization according to their needs. The author urges the readers to look up the implementation of KNN in their preferred language – Python or R and try to build a model with a real-life use case to deepen their understanding. K-Nearest Neighbors algorithm has proved to be not only the easiest algorithm to understand but also the easiest algorithm to be implemented since we are learning only 2 hyperparameters for the same.
Understanding Intuition Behind K-Nearest Neighbours Algorithm
Let’s take an example of two sets of images that we have. We are training the model with images of Team No. 7 from Naruto, which has different images of Naruto, Sasuke, Sakura, and Kakashi sensei. The model learns the labels of these four classes through the training dataset provided. When a new image of Naruto is fed to the algorithm, it looks for various images similar to the one being fed to it as the test image. The model finds 5 (k) images closer to the test image; the classes for these images are – Naruto, Sasuke, Naruto, Naruto, and Kakashi. The probability of the image being Naruto is 3/5 or 0.6, the likelihood of the image being Sasuke is 1/5 or 0.2, and the possibility of the image being Kakashi is 1/5 or 0.2. This implies that the test image is Naruto since the probability is the highest compared to other class labels. The KNN model figured this out using “distance metrics,” which are discussed further in detail in this article. Essentially the similarity between the features of the test image and training images were considered while calculating probabilities and giving an output.
The KNN algorithm becomes vital for tasks where we want to figure out data categories based on previous learnings. It is a simple mathematical model which offers accurate predictions.
Working of the K-Nearest Neighbours Algorithm
The first and foremost step while working with the KNN algorithm is to figure out the number “k.” Then it is necessary to figure out a distance metric. Will the model use Euclidean distance or the Manhattan distance to calculate the similarity between the features of the image? Based on whatever distance metric and the value of k that we have chosen, the model now figures out k closest data points to the data point that is being tested in question. The category with the highest probability of existing in the k data points that have just been chosen is the category assigned to the test data point.
Now the readers must be wondering how the value of k was decided. Is there a way to define the most optimal or most suitable value of k? Assigning the value of k is a hit-and-trial phenomenon. Machine learning developers generally start with a random value of k and then keep changing it until they receive an acceptable accuracy score. The best way to observe this is not by noting the value of k and its corresponding accuracy or loss rate but is to plot a graph between the same to visualize better how the model is performing when you are varying the value of k. Readers should take care that k should not be kept low as it will cause unstable decision boundaries and cause more losses. The plot will give you the value of k, presenting maximum accuracy and minimum loss.
After we have zeroed down on the value of k, the next step is to find the distance metric, which means the formula that will define the distance between the features of the test data point and the already existing data points. Here are three ways to calculate the distance metric:
- The square root of the sum of squared differences is also known as the Euclidean distance.
- The absolute difference is also known as the Manhattan distance.
- If the values are the same, give the output 0. If they are not, give the output 1, also known as the Hamming distance.
The method we have discussed for the k nearest neighbors algorithm is the brute force approach to solving this problem. In this, the approach taken is a simple mathematical approach wherein the probability of the categories is calculated, giving the label of the data point for which prediction has to be made.
The problem with this approach is that we traverse all data points linearly to determine the probability of different classes. While this approach won’t matter if we have a small dataset, for datasets with hundreds of thousands of data points, it will take a significant amount of time to run 1 test data point, which becomes a primary concern. To tackle this issue, the binary search methodology is applied. To perform a binary search in the k nearest neighbors algorithm, we need to build a k-dimensional tree which is essentially a binary tree with the data points arranged hierarchically. The closeness of the test data point with the previous data points is figured out by traversing through the branches of this tree, which decreases the search time.
In continuation with the k-dimensional tree approach, another popular method is to build clusters of data points that are similar or belong to the same category. This approach is termed as building a ball tree. Let us try to understand this approach. Similar data points are clustered together, and each cluster’s point of focus is the centroid. Whenever a new data point is supposed to be labeled under a specific category, the distance metric is applied to the test data point and the centroid of the clusters.
Applications of K Nearest Neighbours Algorithm
Due to the vast implication of k nearest neighbors in figuring out the particular class of a data point based on experience with data points, there are several use cases where it can be used. Some of these are mentioned below:
- A common problem with datasets during pre-processing is that engineers and analysts find several missing values in their datasets which causes their models to perform poorly. While several techniques can solve this issue, such as replacing the missing value with 0 or an average value, the same will not work for categorical variables, as calculating the average for the same becomes a little tricky. For this purpose, KNN can be deployed, and a class can be predicted for a particular row based on the previous data points.
- One specific form of unstructured data is clickstream data, which records users’ behavior when visiting websites. This data is fed to recommendation engines that replicate their behavior and predict their future activity, which helps them to recommend products and advertisements. For small-scale recommendation engines, the k nearest neighbors algorithm is promising.
- In healthcare, the KNN algorithm can also be used to categorize if a person has a particular disease or not, which can be very useful in predicting chronic illnesses and saving thousands of lives.
Conclusion
The k-Nearest Neighbours algorithm is one of the fundamental algorithms that machine learning engineers use. It is instrumental in doing a quick categorization of data points. Since the approach is relatively mathematical, it is easy for students to understand. The training samples are engraved in the algorithm’s memory, making it flexible for welcoming any new instances of data that may come its way and acting on it with the most accuracy and minimum loss rate. In general, machine learning algorithms require developers and engineers to fine-tune many parameters to improve efficiency and decrease the loss rate; however, with the KNN algorithm, such is not the case. Only two hyperparameters must be learned – the value of k and the distance metric. That being said, since this is an algorithm that considers the previous data points, it consumes a significant amount of memory, making this algorithm run slower compared to smaller datasets. Hence scalability poses an issue. KNN might not work well without optimization techniques for images with higher dimensionality since the number of dimensions introduces more complexity to this algorithm.
Read more about learning paths at codedamn here.
Happy Learning!
Sharing is caring
Did you like what Pooja Gera wrote? Thank them for their work by sharing it on social media.