How to find the nearest neighbors of 1 Billion records with Spark?

Tags:

Given 1 Billion records containing following information:

    ID  x1  x2  x3  ... x100
    1   0.1  0.12  1.3  ... -2.00
    2   -1   1.2    2   ... 3
    ...

For each ID above, I want to find the top 10 closest IDs, based on Euclidean distance of their vectors (x1, x2, ..., x100).

What's the best way to compute this?

672

asked May 03 '16 18:05

Osiris

2 Answers

Performing a brute-force comparison of all records against all records is a losing battle. My suggestion would be to go for a ready-made implementation of k-Nearest Neighbor algorithm such as the one provided by scikit-learn then broadcast the resulting arrays of indices and distances and go further.

Steps in this case would be:

1- vectorize the features as Bryce suggested and let your vectorizing method return a list (or numpy array) of floats with as many elements as your features

2- fit your scikit-learn nn to your data:

nbrs = NearestNeighbors(n_neighbors=10, algorithm='auto').fit(vectorized_data)

3- run the trained algorithm on your vectorized data (training and query data are the same in your case)

distances, indices = nbrs.kneighbors(qpa)

Steps 2 and 3 will run on your pyspark node and are not parallelizable in this case. You will need to have enough memory on this node. In my case with 1.5 Million records and 4 features, it took a second or two.

Until we get a good implementation of NN for spark I guess we would have to stick to these workarounds. If you'd rather like to try something new, then go for http://spark-packages.org/package/saurfang/spark-knn

142

answered Sep 20 '22 03:09

architectonic

As it happens, I have a solution to this, involving combining sklearn with Spark: https://adventuresindatascience.wordpress.com/2016/04/02/integrating-spark-with-scikit-learn-visualizing-eigenvectors-and-fun/

The gist of it is:

Use sklearn’s k-NN fit() method centrally
But then use sklearn’s k-NN kneighbors() method distributedly

answered Sep 20 '22 03:09

xenocyon

Related questions
                            
                                spark.driver.extraClassPath Multiple Jars
                            
                                Spark DataFrame equivalent to Pandas Dataframe `.iloc()` method?
                            
                                How to use from_json with schema as string (i.e. a JSON-encoded schema)?
                            
                                Spark: count percentage percentages of a column values
                            
                                TypeError: 'Column' object is not callable using WithColumn
                            
                                The purpose of ClosureCleaner.clean
                            
                                How to get WebUI URI from SparkContext
                            
                                how to deal with error SPARK-5063 in spark
                            
                                'Connection Refused' error while running Spark Streaming on local machine
                            
                                Spark write Parquet to S3 the last task takes forever
                            
                                What is the difference between Spark DataSet and RDD
                            
                                In Spark is counting the records in an RDD expensive task?
                            
                                YARN: What is the difference between number-of-executors and executor-cores in Spark?
                            
                                Difference between QuantileDiscretizer and Bucketizer in Spark
                            
                                How to know which count query is the fastest?
                            
                                pyspark -- best way to sum values in column of type Array(Integer())
                            
                                Spark Configuration: memory/instance/cores
                            
                                PySpark reduceByKey? to add Key/Tuple
                            
                                Spark and SparkSQL: How to imitate window function?
                            
                                How to check that the SparkContext has been stopped?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to find the nearest neighbors of 1 Billion records with Spark?

Tags:

euclidean-distance

apache-spark

pyspark

nearest-neighbor

spark-dataframe

Osiris

People also ask

2 Answers

architectonic

xenocyon

Recent Activity

Donate For Us