Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark: Find pairs having at least n common attributes?

I have a dataset consisting of (sensor_id, timestamp, data) (the sensor_id are ids of IoT devices, timestamp is UNIX time and data is an MD5 hash of their output at that time). There is no primary key on the table but each row is unique.

I need to find all pairs of sensor_ids s1 and s2 such that these two sensors have at least n (n=50) entries (timestamp, data) in common between them i.e. on n different occasions they emitted same data at same timestamp.

For a sense of magnitudes of the data, I have 10B rows and ~50M distinct sensor_ids and I believe that there are around ~5M pairs of sensor-ids that emitted same data at same timestamp at least 50 times.

What's the best way to do this in Spark? I tried various approaches (group-by (timestamp, data) and/or self-joining) but they are prohibitively expensive in complexity.

like image 443
pathikrit Avatar asked Mar 11 '17 22:03

pathikrit


1 Answers

This is a pseudo-code, abstracting from Spark. You can sort your dataset first:

select id, timestamp, data order by timestamp, data, id

Exemplary 10 rows:

s1,100,a  #1
s2,100,a  #2
s3,100,a  #3
s4,100,b  #4
s1,101,a  #5
s3,101,b  #6
s4,101,b  #7
s2,101,a  #8
s3,102,b  #9
s4,102,b  #10

Now iterate top to bottom, and as long as the timestamp and data is the same as previous entry build a list of entries.

In our example rows 1-3 form such a list, so we see some potential pairs already:

s1, s2
s1, s3
s2, s3

Row #4 is just a single entry with (100,b), we can skip it. Row #5 only one entry with (101,a), we can skip it.

Row #6 and #7 are new pair:

s3, s4

Also #9 and #10 form a pair

Putting it all together one can easily count pairs:

s1, s2
s1, s3
s2, s3
s3, s4
s3, s4

The benefit of this method is that if you can sort the file, you can split the sorted dataset into multiple smaller chunks (the chunks should be splitted on the group boundaries - i.e. #1,2,3 should be in one chunk) , calculate the pairs, and join the final results as a last step.

I hope this helps.

like image 145
Piotr Reszke Avatar answered Sep 21 '22 09:09

Piotr Reszke