K-Nearest Neighbor

K-Nearest Neighbor is one of the simpler machine learning algorithms, and it's a good fit for classification, regression, and especially recommendation systems. The whole idea boils down to one sentence: tell me who your neighbors are, and I'll tell you who you are.

In practice, that means calculating the distance, usually Euclidean, sometimes Manhattan, between a query point and every other point in the dataset, then looking at the k closest ones to decide the answer.

What Kind of Problems Is KNN Suited For?

KNN works for both classification and regression problems, but in practice the industry leans heavily toward using it for classification. That said, the more important principle here isn't about KNN specifically: it's about machine learning in general. The real skill is understanding your data well enough to match it with the right algorithm. There's no hard technical barrier stopping you from applying any algorithm to any dataset, but that doesn't mean you'll get meaningful results. Fit matters.

One of KNN's defining characteristics is that it's a form of lazy learning, sometimes called instance-based learning. Most algorithms have a distinct training phase where they build an internal model from the data. KNN skips that entirely. It simply stores the dataset and defers all computation to the moment a prediction is actually requested: at that point, it scans through the stored data, finds the K nearest neighbors, and returns a result. In a sense, it never truly "learns" anything; it just remembers everything and works it out on demand.

This has a practical consequence that becomes significant at scale. Because all the computation happens at prediction time rather than upfront, large datasets sitting in memory can create serious performance bottlenecks. Every single prediction requires scanning through all stored data points, which gets expensive fast. There are approaches to mitigate this, data structures like KD-trees, approximate nearest neighbor methods, and dimensionality reduction techniques, but those are topics for another discussion.

In the sections that follow, we'll get into the mechanics of how KNN actually works, how to apply it to real problems, and how to implement it from scratch in code.

How Does KNN Work?

The intuition behind KNN can be summed up in a single phrase: "Tell me who your neighbors are, and I'll tell you who you are." The algorithm determines similarity between data points by calculating the distance between them, and that distance is what decides who counts as a neighbor.

This is where distance metrics come in. The two most commonly used are Euclidean distance and Manhattan distance.

Euclidean Distance

This is the straight-line distance between two points: the most natural way to measure distance in everyday space. For two points p and q with n features, the formula is:

d(p,q)=∑i=1n(qi−pi)2d(p,q)=i=1∑n(qi−pi)2

In simpler terms: subtract each feature value of one point from the corresponding feature value of the other, square the differences, sum them all up, and take the square root.

Manhattan Distance

Instead of a straight line, Manhattan distance measures distance as if you were navigating a city grid: only horizontal and vertical moves allowed, no diagonals. The formula is:

d(p,q)=∑i=1n∣qi−pi∣d(p,q)=i=1∑n∣qi−pi∣

Here you simply sum the absolute differences between corresponding feature values, without squaring them.

Beyond these two, there are several other distance metrics worth knowing: Cosine similarity, Chi-Square, and Minkowski distance among them. The right choice depends on the nature of your data and the problem you're solving.

Now, about that "K." In KNN, K simply refers to how many neighbors the algorithm considers when making a decision: think of it as the number of nearby data points that get a vote. A higher K means more neighbors are consulted, which tends to produce smoother, more stable predictions but can blur boundaries between categories. A lower K is more sensitive to local patterns but can be noisier.

One thing worth clarifying upfront: the role of K here is different from its role in K-Means Clustering. In K-Means, choosing the right K is a critical part of the algorithm: it defines how many clusters the model will produce, and getting it wrong meaningfully changes the outcome. In KNN, K is more of a tuning parameter. It influences performance, but it doesn't define the structure of the solution in the same fundamental way. We'll keep the K-Means discussion for its own article.

KNN in Action

Now that we've covered the mechanics, let's see how KNN actually behaves with real data. In the diagram below, you'll find a simple scatter plot with three groups of data points, blue squares, green circles, and red triangles, each representing a different class.

The idea is straightforward: when a new, unclassified data point appears on the plot, KNN looks at the K nearest points surrounding it and takes a vote. Whichever class shows up most among those neighbors is the class the new point gets assigned to.

This visual makes the algorithm's logic immediately clear: classification isn't based on any complex internal model or learned representation. It's purely geometric. Proximity determines identity, and the boundary between classes emerges naturally from where the data points sit relative to each other.

In the sections that follow, we'll walk through this step by step with concrete examples to show exactly how the neighbor selection and voting process plays out.

When a new data point enters the picture, represented here as an orange star, KNN gets to work immediately. Using whichever distance metric we've chosen, the algorithm calculates the distance between the orange star and every other data point in the dataset. It then ranks those distances and selects the K closest ones as the star's nearest neighbors.

At this stage, nothing has been decided yet. The algorithm has simply done the math and identified which data points are geographically closest to our new point in the feature space. What those neighbors are, blue squares, green circles, or red triangles, is what determines the classification, and that's exactly what we'll look at next.

And that's really the essence of it. KNN is refreshingly transparent: there's no complex internal model being built, no abstract representations being learned. Just distance calculations, a count of the closest neighbors, and a majority vote. For an algorithm that shows up in so many real-world applications, the underlying logic is about as straightforward as machine learning gets.

What makes it powerful isn't complexity: it's the fact that geometric proximity in feature space turns out to be a surprisingly reliable proxy for similarity in a wide range of problems. Keep the data clean, choose your distance metric thoughtfully, and tune your K, and you have a solid, interpretable baseline that's hard to beat for many classification tasks.

Let's Write the Code From Scratch

Before we dive in, I want to address something directly. If you search for KNN implementations online, the vast majority of articles and tutorials will reach for a machine learning library, scikit-learn, TensorFlow, or similar, and have the algorithm running in a handful of lines. There's nothing wrong with those libraries. They're well-built, well-tested, and absolutely the right choice in production.

But that's not what we're doing here.

My reason is simple: using a library without understanding what's happening underneath it will only get you so far. Standard datasets and textbook problems are forgiving: you can get good results by following a tutorial without truly understanding the mechanics. Real-world problems are not that forgiving. They're messier, more ambiguous, and they don't come with a clear instruction manual. The developers who can navigate those situations are the ones who understand what the algorithm is actually doing, not just how to call it. Writing it from scratch is, in my view, the most honest path to that understanding.

So that's what we're going to do.

Defining the Problem

Before writing a single line of code, we need a concrete problem to solve. Let's say we run a website that publishes technical articles, and we want to recommend similar articles to readers based on what they're currently reading: a simple but genuinely useful feature.

To use KNN here, we need data and we need features. Our data points are the articles themselves. The features that describe each article could be things like:

Category: the broad topic area
Subcategory: a more specific classification within that topic
Programming language: the language the article focuses on, if any
Development platform: the environment or framework the article relates to

With those features defined, each article becomes a point in a four-dimensional feature space, and KNN can find the nearest neighbors, the most similar articles, for any given piece of content.

Let's start building it.

Data and Dataset

Every machine learning solution starts in the same place: the data. Before writing a single line of algorithm code, you need to understand what your data looks like, what features it carries, and whether those features are actually meaningful for the problem you're trying to solve.

In our case, the dataset is a collection of articles published on a website. Each article is a data point, and the features that describe it are what KNN will use to measure similarity between articles. For this example, we'll work with four features:

Category: the broad topic the article falls under
Subcategory: a more specific classification within that category
Programming language: the language covered or used in the article
Development platform: the framework, environment, or platform the article relates to

Each article, when described by these four features, becomes a point in a four-dimensional space. Two articles that share the same category, subcategory, language, and platform will sit very close together in that space. Two articles with nothing in common will sit far apart. KNN uses those distances to decide which articles are similar enough to recommend.

With the dataset defined, we have everything we need to start writing the implementation.

Table Development Platform

Table Category

Table Subcategory

Table Languages

Table Articles

Implementation

Now that the data and problem are clearly defined, we're ready to build the algorithm. The code below is taken directly from one of my open source projects, a machine learning library called Ellipses, where I've implemented KNN and several other algorithms from scratch, without relying on any external ML dependencies.

The goal here isn't just to make it work. It's to write something readable enough that you can follow the logic step by step and map it back to everything we've covered so far: the distance calculations, the neighbor selection, the voting mechanism. By the time we're done, the implementation should feel less like code and more like a direct translation of the algorithm itself.

Let's get into it.

IKNearestNeighbor

The interface is the natural starting point. Before writing any implementation logic, defining the contract makes it clear exactly what the algorithm needs to do and what it exposes to the outside world.

We need two methods:

LoadDataSet handles loading the data into the algorithm, with an optional normalization parameter. Normalization is worth flagging here: when your features are on very different scales (say, one feature ranges from 0 to 1 and another from 0 to 10,000), distance calculations can become heavily skewed toward the larger-scale feature. Normalizing brings everything onto a common scale and ensures that no single feature dominates the distance metric unfairly.

GetNearestNeighbors is where the core work happens. It takes the feature data of a target point and a K value, scans through the loaded dataset, and returns the K closest matches. It also accepts two optional parameters:

subtractOrigin: a flag that tells the algorithm not to include the data point you passed in as one of the results. This is important when the query point already exists in the dataset: without this, the algorithm would always return the point itself as its own nearest neighbor, which is technically correct but completely useless.
distanceCalculator: the distance function the algorithm uses to measure similarity between points. Euclidean distance is the default, but the interface is designed to accept any distance function, giving you the flexibility to swap in Manhattan, Cosine, or any other metric depending on what suits your data.

With the interface defined, we have a clear blueprint for the implementation that follows.

1using System.Collections;
2
3namespace Ellipses.Interfaces
4{
5    public interface IKNearestNeighbors
6    {
7        void LoadDataSet(T[] models, bool normalization = false);
8
9        IList GetNearestNeighbors(double[] y, int k,
10            bool substractOrigin = false, IDistanceCalculater distanceCalculater = null);
11    }
12}

KNearestNeighbour

1/* ========================================================================
2 * Ellipses Machine Learning Library 1.0
3 * https://www.ellipsesai.com
4 * ========================================================================
5 * Copyright Ali Gulum
6 *
7 * ========================================================================
8 * Licensed under the Creative Commons Attribution-NonCommercial 4.0 International License;
9 * you may not use this file except in compliance with the License.
10 * You may obtain a copy of the License at
11 *
12 *     https://creativecommons.org/licenses/by-nc/4.0
13 *
14 * Unless required by applicable law or agreed to in writing, software
15 * distributed under the License is distributed on an "AS IS" BASIS,
16 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
17 * See the License for the specific language governing permissions and
18 * limitations under the License.
19 * ========================================================================
20 */
21
22using System.Collections;
23using System.Collections.Generic;
24using System.Linq;
25using Ellipses.Calculaters;
26using Ellipses.Helpers;
27using Ellipses.Interfaces;
28using Ellipses.Metrics;
29
30namespace Ellipses.Algorithms
31{
32    /// ******************Description************************************************************************************************************************************************
33    /// KNearestNeighbor class is the helper for finding nearest neighbor on the dataset which can be build by
34    /// given data and its features. Given data can be provided as custom object type,the fields which is/are going to be use as feature for processing
35    /// should be marked as AiFieldAttribute, other fields which is/are not marked with as AiFieldAttribute will not be processing.
36    /// There are three distance calculation formulas which can be used for distance calculation Euclidean,Manhattan and Minkowski. It is possible to overwrite or create different 
37    /// distance calculater by inheritancing IDistanceCalculater interface and passing new distance calculater on the KNearestNeighbor constructer or GetNearestNeighbors function.
38    /// Default distance calculater is Euclidean.
39    /// *****************************************************************************************************************************************************************************
40    public class NearestNeighbors : IKNearestNeighbors
41    {
42        //Helper class for converting models
43        private readonly IConverter _converter;
44
45        //Helper classes for calculating distance, normalization and converting models
46        private readonly IDistanceCalculater _distanceCalculater;
47        private readonly INormalizer _normalizer;
48
49        //Flag if the data normalized or not
50        private bool _dataNormalized;
51
52        //Data set as double array list 
53        private double[][] _dataSet;
54
55        //Data set as matrix
56        private Matrix _matrix;
57
58        //Data set without convertation as base shape
59        private IList _models;
60
61        /// 
62        ///     KNearestNeighbour
63        /// 
64        /// Distance calculater
65        /// Normalizer
66        /// Converter for the models
67        public NearestNeighbors(IDistanceCalculater distanceCalculater = null, INormalizer normalizer = null,
68            IConverter converter = null)
69        {
70            _distanceCalculater = distanceCalculater ?? new Euclidean();
71            _normalizer = normalizer ?? new Normalizer();
72            _converter = converter ?? new Converter();
73        }
74
75        /// 
76        ///     Load data set
77        /// 
78        /// Data set
79        /// Normalize data
80        public void LoadDataSet(T[] models, bool normalization = false)
81        {
82            _dataNormalized = normalization;
83            _dataSet = _converter.ConvertModels(models);
84            _models = models;
85            if (normalization)
86                _dataSet = _normalizer.Normalize(_dataSet);
87
88            _matrix = new Matrix(_dataSet);
89        }
90
91        /// 
92        ///     Get Nearest Neighbors from the data set according to given data by y
93        /// 
94        /// Data to check
95        /// Neighbors
96        /// Return data with origin
97        /// Distance calculater
98        public IList GetNearestNeighbors(double[] y, int k, bool substractOrigin = false,
99            IDistanceCalculater distanceCalculater = null)
100        {
101            var distanceHelper = distanceCalculater ?? _distanceCalculater;
102            var dists = new Dictionary();
103            var data = _matrix;
104            var featureLength = y.Length - 1;
105
106            var normalizedY = y;
107            if (_dataNormalized)
108                normalizedY = _normalizer.NormalizeInput(y);
109
110            var inputVector = new Vector(normalizedY);
111            for (var i = 0; i <= data.Rows - 1; i++)
112            {
113                var x = data[i];
114                var distance = distanceHelper.CalculateDistance(x, inputVector, featureLength);
115                dists.Add((TObjectType) _models[i], distance);
116            }
117            var sorted = dists.OrderBy(kp => kp.Value);
118
119            return substractOrigin ? sorted.Skip(1).Take(k).ToArray() : sorted.Take(k).ToArray();
120        }
121    }
122}
123

Euclidean Distance

1
2/* ========================================================================
3 * Ellipses Machine Learning Library 1.0
4 * https://www.ellipsesai.com
5 * ========================================================================
6 * Copyright Ali Gulum
7 *
8 * ========================================================================
9 * Licensed under the Creative Commons Attribution-NonCommercial 4.0 International License;
10 * you may not use this file except in compliance with the License.
11 * You may obtain a copy of the License at
12 *
13 *     https://creativecommons.org/licenses/by-nc/4.0
14 *
15 * Unless required by applicable law or agreed to in writing, software
16 * distributed under the License is distributed on an "AS IS" BASIS,
17 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18 * See the License for the specific language governing permissions and
19 * limitations under the License.
20 * ========================================================================
21 */
22
23using System;
24using System.Collections.Generic;
25using Ellipses.Interfaces;
26using Ellipses.Metrics;
27
28namespace Ellipses.Calculaters
29{
30    public class Euclidean : IDistanceCalculater
31    {
32        /// 
33        ///     Calculate distance with Euclidean Distance
34        /// 
35        /// Data set
36        /// Data to check
37        /// Features length
38        public double CalculateDistance(IReadOnlyList x, IReadOnlyList y, int length)
39        {
40            var distance = 0d;
41            for (var j = 0; j <= length; j++)
42            {
43                if (double.IsNaN((dynamic) x[j]) || double.IsNaN((dynamic) y[j])) continue;
44                distance += Math.Pow(y[j] - x[j], 2);
45            }
46            return Math.Sqrt(distance);
47        }
48
49        /// 
50        ///     Calculate distance with Euclidean Distance
51        /// 
52        /// Data set as vector
53        /// Vector to check
54        /// Features length
55        public double CalculateDistance(Vector vector, Vector y, int length)
56        {
57            var distance = 0d;
58            var x = vector.ToArray();
59            for (var j = 0; j <= length; j++)
60            {
61                if (double.IsNaN((dynamic) x[j]) || double.IsNaN((dynamic) y[j])) continue;
62                distance += Math.Pow(y[j] - x[j], 2);
63            }
64            return Math.Sqrt(distance);
65        }
66    }
67}
68

The implementation itself is fairly self-explanatory at this point. It takes the models as input, converts them into a two-dimensional array, applies the chosen distance function across the feature values, and returns the K closest points ranked by distance. If you'd like to dig into the converter class that handles the data transformation, you can find the full source code on the Ellipses project page: I've kept it out of this article to stay focused on the algorithm itself.

With the implementation in place, let's put it to work on our article recommendation problem. As a reminder, here's the article dataset we defined earlier:

This means each article in our dataset is described by four features, Category, Subcategory, Programming Language, and Development Platform, and those four features are exactly what KNN will use to measure similarity between articles.

The first step in the implementation is defining the model. Each article gets represented as a class, and the fields that should be treated as features by the algorithm are marked with the AiField attribute. This is how the converter class knows which properties to include when building the feature array: anything tagged with AiField gets picked up, anything without it gets ignored.

In our case, that means everything except the article's Id. The Id is just an identifier: it carries no meaningful information about the content of the article and would only introduce noise into the distance calculations. Category, Subcategory, Programming Language, and Development Platform are the fields that actually describe what an article is about, so those are the ones we expose to the algorithm.

With the model defined, we have a clean, structured representation of each article that KNN can work with directly.

1 public class Article
2    {
3        public int Id { get; set; }
4
5        [AiField]
6        public int DevPd { get; set; }
7
8        [AiField]
9        public int CategoryId { get; set; }
10
11        [AiField]
12        public int SubcategoryId { get; set; }
13
14        [AiField]
15        public int LanguageId { get; set; }
16    }

Now we need a dataset to actually run the algorithm against. In a real production system, this step would typically involve pulling records from a database: querying your articles table, mapping the results to your model, and feeding them into the algorithm. For the purposes of this example, we'll define a static list and populate it manually, which keeps things focused on the algorithm itself rather than the data access layer.

The list is straightforward: each entry is an instance of our article model, with its four feature fields filled in. Once we have a representative set of articles covering different categories, subcategories, languages, and platforms, we have everything KNN needs to start finding similarities.

In the next step, we'll load this list into our KNearestNeighbor class and run our first query.

1 var articles = new List
2
3            {
4                //K-NearestNeighbor
5                new Article() {Id = 1, DevPd = 1, CategoryId = 1, SubcategoryId = 1, LanguageId = 1},
6                //Support Vector Machine
7                new Article() {Id = 2, DevPd = 1, CategoryId = 1, SubcategoryId = 2, LanguageId = 1},
8                //Neural Network
9                new Article() {Id = 3, DevPd = 2, CategoryId = 1, SubcategoryId = 3, LanguageId = 2},
10                //Mobile Development Tips
11                new Article() {Id = 4, DevPd = 2, CategoryId = 2, SubcategoryId = 4, LanguageId = 2},
12                //Using Sql Lite
13                new Article() {Id = 5, DevPd = 2, CategoryId = 3, SubcategoryId = 6, LanguageId = 2},
14                //Using Tensorflow
15                new Article() {Id = 6, DevPd = 1, CategoryId = 1, SubcategoryId = 3, LanguageId = 3}
16            };

With the dataset ready, the next step is straightforward. We instantiate our KNearestNeighbor class and pass the article list into LoadDataSet. This is where the data gets converted into the two-dimensional feature array that the algorithm will work with internally: from this point on, each article is no longer a model object but a point in four-dimensional feature space.

Once the dataset is loaded, the class is ready to start answering queries. All the setup work is done: what comes next is the part we've been building toward: passing in a target article and asking the algorithm to find its nearest neighbors.

1var nearestNeighbors = new NearestNeighbors
2();
3nearestNeighbors.LoadDataSet(articles.ToArray());

And that's all it takes to get the model loaded and ready. From this point, the heavy lifting is done: the dataset is in memory, converted into feature arrays, and the algorithm is ready to start making predictions.

Now let's put it to use with a real scenario. Say a user is currently reading the K-Nearest Neighbor article: this very one, in fact. We want to surface a set of similar articles they might find interesting next.

To do that, we pass two things into GetNearestNeighbors: the features of the current article represented as a two-dimensional array, and a K value that tells the algorithm how many suggestions we want to return. If we want to show the user three recommended articles, K is 3. If we want five, K is 5.

The feature array we pass in describes the current article: its category, subcategory, programming language, and development platform. KNN will take those values, calculate the distance between this point and every other article in the dataset, and hand back the K closest matches. Those are your recommendations.

Let's see what that looks like in code.

1 var currentArticle = articles.FirstOrDefault(x => x.Id == 1);
2 var articleToPass = new double[] {currentArticle.DevPd,currentArticle.CategoryId,  currentArticle.SubcategoryId,currentArticle.LanguageId};

With K set to 2, we're asking the algorithm to find the two most similar articles to the one currently being read. The feature array we pass in represents the K-Nearest Neighbor article itself, its category, subcategory, language, and platform, and the subtractOrigin flag ensures the article itself doesn't show up in its own recommendations.

The algorithm does the rest. It calculates the distance between our target article and every other article in the dataset, ranks them by proximity, and returns the two closest matches. Those two articles are what gets surfaced to the user as recommendations.

Simple as that: a handful of lines of code, no external ML library, and a fully working article recommendation feature built on a from-scratch KNN implementation.

1var suggestedArticles = nearestNeighbors.GetNearestNeighbors(articleToPass, 2,true);

And that's really the takeaway. KNN is one of those rare algorithms where the gap between the theory and the implementation is almost nonexistent. The logic is intuitive enough to explain in a single sentence, and the code maps directly onto that logic without any hidden complexity in between.

What makes it worth understanding deeply, beyond just calling a library, is that once you've built it from scratch, you develop a feel for where it works well and where it starts to struggle. You understand why the choice of distance metric matters, why K needs to be tuned thoughtfully, and why large datasets can become a bottleneck. That kind of understanding doesn't come from a three-line scikit-learn example.

Hopefully this walkthrough has made the algorithm feel approachable, and more importantly, usable: not just as a black box you import, but as a tool you genuinely understand.