Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark euclidean distance between entry and column

I am working with pyspark, and wondering if there is any smart way to get euclidean dstance between one row entry of array and the whole column. For instance, there is a dataset like this.

+--------------------+---+
|            features| id|
+--------------------+---+
|[0,1,2,3,4,5     ...|  0|
|[0,1,2,3,4,5     ...|  1|
|[1,2,3,6,7,8     ...|  2|

Choose one of the column i.e. id==1, and calculate the euclidean distance. In this case, the result should be [0,0,sqrt(1+1+1+9+9+9)]. Can anybody figure out how to do this efficiently? Thanks!

like image 827
Yong Hyun Kwon Avatar asked Oct 13 '17 08:10

Yong Hyun Kwon


1 Answers

You can do BucketedRandomProjectionLSH [1] to get a cartesian of distances between your data frame.

from pyspark.ml.feature import BucketedRandomProjectionLSH

brp = BucketedRandomProjectionLSH(
    inputCol="features", outputCol="hashes", seed=12345, bucketLength=1.0
)
model = brp.fit(df)
model.approxSimilarityJoin(df, df, 3.0, distCol="EuclideanDistance")

You can also get distances for one row to column with approxNearestNeighbors [2], but the results are limited by numNearestNeighbors, so you could give it the count of the entire data frame.

one_row = df.where(df.id == 1).first().features
model.approxNearestNeighbors(df2, one_row, df.count()).collect()

Also, make sure to convert your data to Vectors!

from pyspark.sql import functions as F

to_dense_vector = F.udf(Vectors.dense, VectorUDF())
df = df.withColumn('features', to_dense_vector('features'))

[1] https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html?highlight=approx#pyspark.ml.feature.BucketedRandomProjectionLSH

[2] https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html?highlight=approx#pyspark.ml.feature.BucketedRandomProjectionLSHModel.approxNearestNeighbors

like image 165
munro Avatar answered Sep 22 '22 07:09

munro