Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster kNN algorithm in Python

I want to code my own kNN algorithm from scratch, the reason is that I need to weight the features. The problem is that my program is still really slow despite removing for loops and using built in numpy functionality.

Can anyone suggest a way to speed this up? I don't use np.sqrt for the L2 distance because it's unnecessary and actually slows it all up quite a bit.

class GlobalWeightedKNN:
    """
    A k-NN classifier with feature weights

    Returns: predictions of k-NN.
    """

    def __init__(self):
        self.X_train = None
        self.y_train = None
        self.k = None
        self.weights = None
        self.predictions = list()

    def fit(self, X_train, y_train, k, weights):        
        self.X_train = X_train
        self.y_train = y_train
        self.k = k
        self.weights = weights

    def predict(self, testing_data):
        """
        Takes a 2d array of query cases.

        Returns a list of predictions for k-NN classifier
        """

        np.fromiter((self.__helper(qc) for qc in testing_data), float)  
        return self.predictions


    def __helper(self, qc):
        neighbours = np.fromiter((self.__weighted_euclidean(qc, x) for x in self.X_train), float)
        neighbours = np.array([neighbours]).T 
        indexes = np.array([range(len(self.X_train))]).T
        neighbours = np.append(indexes, neighbours, axis=1)

        # Sort by second column - distances
        neighbours = neighbours[neighbours[:,1].argsort()]  
        k_cases = neighbours[ :self.k]
        indexes = [x[0] for x in k_cases]

        y_answers = [self.y_train[int(x)] for x in indexes]
        answer = max(set(y_answers), key=y_answers.count)  # get most common value
        self.predictions.append(answer)


    def __weighted_euclidean(self, qc, other):
        """
        Custom weighted euclidean distance

        returns: floating point number
        """

        return np.sum( ((qc - other)**2) * self.weights )
like image 855
Eoin Ó Coinnigh Avatar asked Aug 04 '18 18:08

Eoin Ó Coinnigh


People also ask

How do you get the best K in KNN in Python?

The optimal K value usually found is the square root of N, where N is the total number of samples. Use an error plot or accuracy plot to find the most favorable K value. KNN performs well with multi-label classes, but you must be aware of the outliers.

What is fast KNN in machine learning?

The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine learning algorithm that can be used to solve both classification and regression problems.

Why is my KNN taking so long?

KNN is just a slow algorithm, it's slower for you because computing distances between images is hard at scale, and it's slower for you because the problem is large enough that your cache can't really be used effectively. to at least use all of your cores.


2 Answers

you can take a look at this great article introducing faiss
Make kNN 300 times faster than Scikit-learn’s in 20 lines!
it is on GPU and developed in CPP behind the seen

import numpy as np
import faiss


class FaissKNeighbors:
    def __init__(self, k=5):
        self.index = None
        self.y = None
        self.k = k

    def fit(self, X, y):
        self.index = faiss.IndexFlatL2(X.shape[1])
        self.index.add(X.astype(np.float32))
        self.y = y

    def predict(self, X):
        distances, indices = self.index.search(X.astype(np.float32), k=self.k)
        votes = self.y[indices]
        predictions = np.array([np.argmax(np.bincount(x)) for x in votes])
        return predictions
like image 89
Amirhe Avatar answered Sep 24 '22 10:09

Amirhe


Scikit-learn uses a KD Tree or Ball Tree to compute nearest neighbors in O[N log(N)] time. Your algorithm is a direct approach that requires O[N^2] time, and also uses nested for-loops within Python generator expressions which will add significant computational overhead compared to optimized code.

If you'd like to compute weighted k-neighbors classification using a fast O[N log(N)] implementation, you can use sklearn.neighbors.KNeighborsClassifier with the weighted minkowski metric, setting p=2 (for euclidean distance) and setting w to your desired weights. For example:

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(metric='wminkowski', p=2,
                             metric_params=dict(w=weights))
model.fit(X_train, y_train)
y_predicted = model.predict(X_test)
like image 20
jakevdp Avatar answered Sep 20 '22 10:09

jakevdp