I want to code my own kNN algorithm from scratch, the reason is that I need to weight the features. The problem is that my program is still really slow despite removing for loops and using built in numpy functionality.
Can anyone suggest a way to speed this up? I don't use np.sqrt
for the L2 distance because it's unnecessary and actually slows it all up quite a bit.
class GlobalWeightedKNN:
"""
A k-NN classifier with feature weights
Returns: predictions of k-NN.
"""
def __init__(self):
self.X_train = None
self.y_train = None
self.k = None
self.weights = None
self.predictions = list()
def fit(self, X_train, y_train, k, weights):
self.X_train = X_train
self.y_train = y_train
self.k = k
self.weights = weights
def predict(self, testing_data):
"""
Takes a 2d array of query cases.
Returns a list of predictions for k-NN classifier
"""
np.fromiter((self.__helper(qc) for qc in testing_data), float)
return self.predictions
def __helper(self, qc):
neighbours = np.fromiter((self.__weighted_euclidean(qc, x) for x in self.X_train), float)
neighbours = np.array([neighbours]).T
indexes = np.array([range(len(self.X_train))]).T
neighbours = np.append(indexes, neighbours, axis=1)
# Sort by second column - distances
neighbours = neighbours[neighbours[:,1].argsort()]
k_cases = neighbours[ :self.k]
indexes = [x[0] for x in k_cases]
y_answers = [self.y_train[int(x)] for x in indexes]
answer = max(set(y_answers), key=y_answers.count) # get most common value
self.predictions.append(answer)
def __weighted_euclidean(self, qc, other):
"""
Custom weighted euclidean distance
returns: floating point number
"""
return np.sum( ((qc - other)**2) * self.weights )
The optimal K value usually found is the square root of N, where N is the total number of samples. Use an error plot or accuracy plot to find the most favorable K value. KNN performs well with multi-label classes, but you must be aware of the outliers.
The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine learning algorithm that can be used to solve both classification and regression problems.
KNN is just a slow algorithm, it's slower for you because computing distances between images is hard at scale, and it's slower for you because the problem is large enough that your cache can't really be used effectively. to at least use all of your cores.
you can take a look at this great article introducing faiss
Make kNN 300 times faster than Scikit-learn’s in 20 lines!
it is on GPU and developed in CPP behind the seen
import numpy as np
import faiss
class FaissKNeighbors:
def __init__(self, k=5):
self.index = None
self.y = None
self.k = k
def fit(self, X, y):
self.index = faiss.IndexFlatL2(X.shape[1])
self.index.add(X.astype(np.float32))
self.y = y
def predict(self, X):
distances, indices = self.index.search(X.astype(np.float32), k=self.k)
votes = self.y[indices]
predictions = np.array([np.argmax(np.bincount(x)) for x in votes])
return predictions
Scikit-learn uses a KD Tree or Ball Tree to compute nearest neighbors in O[N log(N)]
time. Your algorithm is a direct approach that requires O[N^2]
time, and also uses nested for-loops within Python generator expressions which will add significant computational overhead compared to optimized code.
If you'd like to compute weighted k-neighbors classification using a fast O[N log(N)]
implementation, you can use sklearn.neighbors.KNeighborsClassifier
with the weighted minkowski metric, setting p=2
(for euclidean distance) and setting w
to your desired weights. For example:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(metric='wminkowski', p=2,
metric_params=dict(w=weights))
model.fit(X_train, y_train)
y_predicted = model.predict(X_test)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With