Are there alternative solution without cross-join in Spark 2?

Question

Stackoverflow!

I wonder if there is a fancy way in Spark 2.0 to solve the situation below. The situation is like this.

Dataset1 (TargetData) has this schema and has about 20 milion records.

id (String)
vector of embedding result (Array, 300 dim)

Dataset2 (DictionaryData) has this schema and has about 9,000 records.

dict key (String)
vector of embedding result (Array, 300 dim)

For each vector of records in dataset 1, I want to find the dict key that will be the maximum when I compute cosine similarity it with dataset 2.

Initially, I tried cross-join dataset1 and dataset2 and calculate cosine simliarity of all records, but the amount of data is too large to be available in my environment.

I have not tried it yet, but I thought of collecting dataset2 as a list and then applying udf.

Are there any other method in this situation? Thanks,

abiratsis · Accepted Answer

There might be two options the one is to broadcast Dataset2 since you need to scan it for each row of Dataset1 thus avoid the network delays by accessing it from a different node. Of course in this case you need to consider first if your cluster can handle the memory cost which 9000rows x 300cols(not too big in my opinion). Also you still need your join although with broadcasting should be faster. The other option is to populate a RowMatrix from your existing vectors and leave spark do the calculations for you

Are there alternative solution without cross-join in Spark 2?

Tags:

scala

apache-spark

user-defined-functions

Ashe

1 Answers

abiratsis

Recent Activity

Donate For Us

Are there alternative solution without cross-join in Spark 2?

Tags:

scala

apache-spark

user-defined-functions

Ashe

1 Answers

abiratsis

Related questions

Recent Activity

Donate For Us