Neptune Blog

The KNN Algorithm – Explanation, Opportunities, Limitations

5 min
23rd April, 2025

The K-Nearest Neighbor (KNN) algorithm is a versatile machine learning algorithm widely applied in fields like handwriting detection, image recognition, and video recognition. KNN is especially valuable when labeled data is scarce or costly to obtain, often achieving high accuracy across various prediction problems.

KNN works by evaluating the local minimum of a target function to approximate an unknown function with the desired precision and accuracy. The algorithm identifies the “neighborhood” of a new input (e.g., a new data point) by assessing its distance to known data points. KNN determines the most relevant data points to predict unknown values accurately, relying on the assumption that the most “informative” neighbors are the closest ones.

In this article, we’ll dive into the core concepts behind KNN and examine a real-world application to understand how it performs in practice.

The lazy learning paradigm and the KNN algorithm

KNN works without a formal training phase, in contrast with eager learning methods that require training data to make predictions on new data. Instead, KNN generates predictions by assessing data similarity and applying distance metrics.

Though this “lazy learning” approach may initially seem less reliable, KNN is highly effective and widely trusted in applications such as:

  • Computer Vision: KNN is often used for image classification, grouping images based on similarity.
  • Content Recommendation: Commonly applied in recommendation engines, KNN remains relevant for suggesting content, even with the advent of more advanced systems.

The curse of dimensionality

KNN works best with a low number of features due to the “curse of dimensionality.” As the number of features grows, KNN requires increasingly more data, which can lead to overfitting. This happens because it becomes difficult to distinguish relevant data points from noise. Studies, such as by Gu and Shao in 2014, demonstrate that KNN performs better in lower-dimensional spaces.

The inner workings of KNN

Thankfully, the KNN algorithm is straightforward to understand. For a new observation, KNN identifies the K closest data points based on distance. Each point is assigned to a specific group if it falls close enough to that group’s perimeter.

In practice:

  • For regression tasks, KNN predicts based on the mean or median of the K nearest neighbors.
  • For classification tasks, KNN predicts by selecting the most common class (mode) among the K nearest points.

A closer look at the structure of KNN

Let’s say we have:

  • A dataset D,
  • A specified distance metric to measure the proximity between observations, 
  • An integer K representing the number of nearest neighbors to consider.

To predict the output y for a new observation X, we follow these steps:

  1. Calculate Distances: Measure the distance between X and each point in the dataset D.
  2. Select the Nearest Neighbors: Identify the K data points closest to X.
  3. Generate Prediction: 
    • For regression tasks, take the mean of the y values from the K nearest neighbors.
    • For classification tasks, use the mode (most common value) of the y values from the K nearest neighbors.

The final prediction is the value obtained in Step 3.

Below is the algorithm outlined in pseudo-code:

Pseudocode for the KNN Algorithm. This pseudocode demonstrates the nearest-neighbor approach for solving optimization problems, particularly in pathfinding scenarios. It takes an nĂ—n distance matrix D and a starting index s as inputs, iteratively building a path by selecting the nearest unvisited node at each step. The process concludes by returning to the starting point, forming a complete path.
Pseudocode for the KNN Algorithm. This pseudocode demonstrates the nearest-neighbor approach for solving optimization problems, particularly in pathfinding scenarios. It takes an nĂ—n distance matrix D and a starting index s as inputs, iteratively building a path by selecting the nearest unvisited node at each step. The process concludes by returning to the starting point, forming a complete path. | Source

How distances and similarities in KNN work

KNN relies on distance metrics to assess how close two data points are, which essentially measures their similarity. The core assumption of KNN is that the closer two given points are to each other, the more related and similar they are

Several common distance metrics are used to determine this similarity. Each metric has strengths suited to different data types, so taking the time to choose the right one for your data is important. Here are some notable distance metrics:

Distance metric
Best suited for

Primarily used for quantitative data

Useful for heterogeneous data types

Applicable to real-valued vector spaces

Often used for binary data

Ideal for categorical data, especially in data transmission or network settings

Tip: Learn more about how distance metrics affect KNN classification results.

Most machine learning libraries provide these metrics, so you don’t need to implement them manually unless you want to explore their inner workings.

Choosing the best value for K

To find the optimal K value for your data, you’ll typically run the KNN algorithm multiple times with varying K values and evaluate each scenario based on accuracy. If the accuracy is stable as K changes, that K value may be a suitable choice.

When selecting K, consider that the feature count and group size are influential factors in the model’s performance. More features or more classes often require larger K values to capture meaningful patterns in the data.

For example:

  • Higher K values: Increasing K generally stabilizes predictions and improves resilience to outliers. A practical approach is incrementally increasing K until your chosen accuracy metric—like the F-Measure—meets an acceptable threshold.
  • K = 1: This makes predictions highly sensitive to noise and outliers, as each prediction relies on a single, possibly unreliable, neighbor.

When choosing the K value, it’s important to consider the distribution of samples across classes:

  • Increasing K: If one class has significantly more samples than others, increasing K helps balance predictions by averaging across more neighbors, reducing the impact of any single data point. This can lead to more stable predictions and prevent the model from being too influenced by outliers.
  • Decreasing K: When classes are evenly distributed or if certain classes have fewer samples, using a lower K allows the model to be more sensitive to closer, potentially relevant neighbors. This approach works well for datasets with a balanced class distribution but may be less effective for imbalanced datasets.

Here are some examples of varying the value of K for a specific dataset:

Clustering results with different K values. Increasing K results in a more granular segmentation that captures finer details (see the first row). However, as K approaches the number of observations (last scenario), the segmentation becomes less effective, leading to  poor results and overfitting.
Clustering results with different K values. Increasing K results in a more granular segmentation that captures finer details (see the first row). However, as K approaches the number of observations (last scenario), the segmentation becomes less effective, leading to  poor results and overfitting. | Source

As you can see, the more neighbors you use, the more accurate the segmentation. However, if we increase the K value until we reach N (the total number of data points), we’ll overfit our model, making it unable to generalize well on unseen observations.

Practical use case: predicting breast cancer diagnosis with KNN

To put our understanding of KNN into practice, we’ll use the Wisconsin Breast Cancer dataset from the UCI Machine Learning Repository. This dataset contains features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass and a target variable indicating whether the tumor is benign or malignant.

To illustrate what we’ve been explaining so far, we’ll try to use KNN against a well-known dataset recording the symptoms of breast cancer of clinical patients from Wisconsin in the US.

KNN algorithm dataset
Characteristics of the Breast Cancer Wisconsin Data Set. The Breast Cancer Wisconsin Data Set is a multivariate data set of integer values used for classification tasks. It contains 10 different attributes and 699 data points, with some missing values. | Source

Step 1: Set up the project

After downloading the data from the link above, we’ll install all the required packages:

!pip install scikit-learn matplotlib pandas

Now, import the necessary libraries and load the dataset:

import pandas as pd

# Load the dataset
data = pd.read_csv('breast-cancer-wisconsin.data', header=None)

# Display dataset information
data.info()
Information about our dataset
Information about our dataset | Source: Author

Add the dataset column names:

# Define the column names 
data.columns = ['Id', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli', 'Mitoses', 'Class']

Step 2: Visualize the data with Plotly

Our dataset exhibits a significant imbalance, where the number of benign cases greatly surpasses the malignant ones. This is common in medical datasets where certain conditions are inherently less frequent than others.

We’ll use Plotly to visualize this imbalance between benign and malignant records. Here’s how to do it:

import plotly.graph_objects as go
from plotly.offline import iplot

# Calculate the counts for each class
target_balance = data['Class'].value_counts().reset_index()

# Create a bar chart
fig = go.Figure(data=[
    go.Bar(
        name='Target Balance',
        x=target_balance['index'].replace({2: 'Benign', 4: 'Malignant'}),
        y=target_balance['Class']
    )
])

# Show the plot
iplot(fig)
Plot of our class distribution showing an imbalance of the data
Plot of our class distribution showing an imbalance of the data | Source: Author

Another key metric is the mitosis level in clinical patients across both groups. Mitosis levels range from 1 (lowest) to 9 (highest), indicating the rate at which tumor cells divide. Typically, patients within the malignant group exhibit higher Mitosis levels, reflecting more aggressive tumor growth.

# Record of Mitosis in the Benign and Malignant Groups
benign_patients = data[data['Class'] == 2]
malignant_patients = data[data['Class'] == 4]

mitoses_benign = benign_patients['Mitoses'].value_counts().sort_index()
mitoses_malignant = malignant_patients['Mitoses'].value_counts().sort_index()

# Grouping both results in a grouped bar chart
fig = go.Figure(data=[
    go.Bar(name='Benign Group Mitoses Levels', x=mitoses_benign.index, y=mitoses_benign),
    go.Bar(name='Malignant Group Mitoses Levels', x=mitoses_malignant.index, y=mitoses_malignant),
])
fig.update_layout(barmode='group', title='Mitoses Levels in Benign vs. Malignant Groups')
fig.show()

We can see the result below:

Level of mitosis in both clinical groups. The benign group is represented in blue, and the malignant group is plotted in red.
Level of mitosis in both clinical groups. The benign group is represented in blue, and the malignant group is plotted in red. | Source: Author

Step 3: Initialize your Neptune experiment

We’ll be using neptune.ai as our experiment tracker so we can clearly log our performance across many different values of K. Neptune is a versatile and highly scalable experiment tracker; you could even track months-long training runs and compare thousands of metrics in no time.

Disclaimer

Please note that this article references a deprecated version of Neptune.

For information on the latest version with improved features and functionality, please visit our website.

First, we’ll create a virtual environment for project dependencies:

!conda create --name neptune python=3.6

Next, we’ll start logging. To get started, create a Neptune account and check out the Quickstart Guide, which includes information on how to connect your Google Colab environment to Neptune. Then, all you’ll have to do is to create a project and check out your API key from the UI. 

Now, initiate your experiment by configuring the necessary parameters:

run = neptune.init_run(
    project="your-workspace-name/your-project-name",  
    api_token="YourNeptuneApiToken",  
)

run["Algorithm"] = "KNN"

params = {
    "algorithm": auto,
    "leaf_size": 30,
    "metric": minkowski,
    "metric_params": None,
    "N_jobs": None,
    "N_neighbors": None,
    "P": 2
    "weight": uniform
}
run["parameters"] = params

Before starting with the KNN model, we need to preprocess the data. We filter the data to keep the non-empty values and we transform the Bare Nuclei variable to integer type. After this stage, all the attributes are of type int64, and there are no null values. Now we split the target variable from the features, and segregate the data into training and testing splits:

# Preprocessing
data = data[data['Bare Nuclei'] != '?']
data['Bare Nuclei'] = data['Bare Nuclei'].astype('int64')

# Segregating features and targets separately
features = data.drop(columns=['Class', 'Id'])
target = data['Class']

# Splitting the data
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=123)

Step 4: Train the model and choose the best K value

In this step, we iterate through a range of three different K values to determine the best fit for your data. Understanding the influence of K is crucial as it defines the boundaries of each class. The value of K affects how these boundaries are drawn, changing with each iteration and potentially altering the classification boundaries with each new set of data.

We’ll be logging each K iteration in Neptune, keeping track of the changes in the accuracy:

See in the app
Accuracy changes tracked in Neptune

We observe the highest accuracy score at 0.992 with K = 6 (note that due to zero indexing, K = 6 is equal to the 5th point on the plot). Other K values {2, 4, 5} yield scores around 0.98 to .99. Given that multiple candidates achieved high scores, the optimal K value was selected as K = 5 based on the dataset characteristics.

We are utilizing the Minkowski distance for the KNN model, but experimenting with different distance metrics could suggest alternative K values.

With the final value of K = 5, we can train the model. Here is the setup for the KNN classifier with the selected parameters:

from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

# KNN Classifier configuration
knn = KNeighborsClassifier(n_neighbors=5, algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, p=2, weights='uniform')

# Training the model
knn.fit(x_train, y_train)

# Making predictions and evaluating accuracy
predictions = knn.predict(x_test)
accuracy = metrics.accuracy_score(y_test, predictions)

print("Final accuracy score:", accuracy)
Our final accuracy with K = 5
Our final accuracy with K = 5 | Source: Author

KNN limitations

KNN is straightforward, requiring only the knowledge of the number of categories for classification. This simplicity allows it to seamlessly incorporate new categories without needing prior data on the existing ones. However, this feature also limits KNN in predicting rare occurrences, such as new diseases, because it lacks historical data to estimate their prevalence in a general population.

Despite its good accuracy on test sets, KNN is slow and resource-intensive. It retains the entire training dataset in memory for making predictions, which can be impractical with large datasets. Additionally, KNN’s typical use of Euclidean distance makes it highly sensitive to feature scale, disproportionately impacting features with larger magnitudes.

Although KNN produces good accuracy on the testing set, the classifier remains slower and costlier in terms of time and memory. It requires large memory to store the entire training dataset for prediction. Furthermore, Euclidean distance is very sensitive to magnitude, hence, features with high magnitudes will always weigh more than their counterparts with low ones.

Given these factors, KNN is less effective for datasets with high dimensionality due to its computational and memory demands.

Conclusion

In this article, we delved into how KNN, as a lazy learning algorithm, stores the entire dataset to make predictions. Unlike model-based algorithms, KNN generates predictions on the fly by assessing the similarity between an input observation and the existing dataset.

Thank you for reading!

Was the article useful?

    This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.