Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find the nearest neighbors of 1 Billion records with Spark?

Given 1 Billion records containing following information:

    ID  x1  x2  x3  ... x100
    1   0.1  0.12  1.3  ... -2.00
    2   -1   1.2    2   ... 3
    ...

For each ID above, I want to find the top 10 closest IDs, based on Euclidean distance of their vectors (x1, x2, ..., x100).

What's the best way to compute this?

like image 672
Osiris Avatar asked May 03 '16 18:05

Osiris


People also ask

How do I find my nearest neighbors distance?

The average nearest neighbor ratio is calculated as the observed average distance divided by the expected average distance (with expected average distance being based on a hypothetical random distribution with the same number of features covering the same total area).

How many neighbors can you have on KNN?

In KNN, K is the number of nearest neighbors. The number of neighbors is the core deciding factor. K is generally an odd number if the number of classes is 2. When K=1, then the algorithm is known as the nearest neighbor algorithm.

What is KNN fit?

When a prediction is made the KNN compares the input with the training data it has stored. The class label of the data point which has maximum similarity with the queried input is given as prediction. Hence when we fit a KNN model it learns or stores the dataset in memory.


2 Answers

Performing a brute-force comparison of all records against all records is a losing battle. My suggestion would be to go for a ready-made implementation of k-Nearest Neighbor algorithm such as the one provided by scikit-learn then broadcast the resulting arrays of indices and distances and go further.

Steps in this case would be:

1- vectorize the features as Bryce suggested and let your vectorizing method return a list (or numpy array) of floats with as many elements as your features

2- fit your scikit-learn nn to your data:

nbrs = NearestNeighbors(n_neighbors=10, algorithm='auto').fit(vectorized_data)

3- run the trained algorithm on your vectorized data (training and query data are the same in your case)

distances, indices = nbrs.kneighbors(qpa)

Steps 2 and 3 will run on your pyspark node and are not parallelizable in this case. You will need to have enough memory on this node. In my case with 1.5 Million records and 4 features, it took a second or two.

Until we get a good implementation of NN for spark I guess we would have to stick to these workarounds. If you'd rather like to try something new, then go for http://spark-packages.org/package/saurfang/spark-knn

like image 142
architectonic Avatar answered Sep 20 '22 03:09

architectonic


As it happens, I have a solution to this, involving combining sklearn with Spark: https://adventuresindatascience.wordpress.com/2016/04/02/integrating-spark-with-scikit-learn-visualizing-eigenvectors-and-fun/

The gist of it is:

  • Use sklearn’s k-NN fit() method centrally
  • But then use sklearn’s k-NN kneighbors() method distributedly
like image 38
xenocyon Avatar answered Sep 20 '22 03:09

xenocyon