I have a dataset consisting of (sensor_id, timestamp, data)
(the sensor_id
are ids of IoT devices, timestamp is UNIX time and data is an MD5 hash of their output at that time). There is no primary key on the table but each row is unique.
I need to find all pairs of sensor_id
s s1
and s2
such that these two sensors have at least n
(n=50
) entries (timestamp, data)
in common between them i.e. on n
different occasions they emitted same data at same timestamp.
For a sense of magnitudes of the data, I have 10B rows and ~50M distinct sensor_ids
and I believe that there are around ~5M pairs of sensor-ids that emitted same data at same timestamp at least 50 times.
What's the best way to do this in Spark? I tried various approaches (group-by (timestamp, data)
and/or self-joining) but they are prohibitively expensive in complexity.
This is a pseudo-code, abstracting from Spark. You can sort your dataset first:
select id, timestamp, data order by timestamp, data, id
Exemplary 10 rows:
s1,100,a #1
s2,100,a #2
s3,100,a #3
s4,100,b #4
s1,101,a #5
s3,101,b #6
s4,101,b #7
s2,101,a #8
s3,102,b #9
s4,102,b #10
Now iterate top to bottom, and as long as the timestamp and data is the same as previous entry build a list of entries.
In our example rows 1-3 form such a list, so we see some potential pairs already:
s1, s2
s1, s3
s2, s3
Row #4 is just a single entry with (100,b), we can skip it. Row #5 only one entry with (101,a), we can skip it.
Row #6 and #7 are new pair:
s3, s4
Also #9 and #10 form a pair
Putting it all together one can easily count pairs:
s1, s2
s1, s3
s2, s3
s3, s4
s3, s4
The benefit of this method is that if you can sort the file, you can split the sorted dataset into multiple smaller chunks (the chunks should be splitted on the group boundaries - i.e. #1,2,3 should be in one chunk) , calculate the pairs, and join the final results as a last step.
I hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With