The problem is: Suppose we have a group of Sets: Set(1,2,3)
Set(1,2,3,4)
Set(4,5,6)
Set(1,2,3,4,6)
, we need to delete all the subsets and finally get the Result: Set(4,5,6)
Set(1,2,3,4,6)
. (Since both Set(1,2,3)
and Set(1,2,3,4)
are the subsets of Set(1,2,3,4,6)
, both are removed.)
And suppose that the elements of the set have order
, which can be Int, Char, etc.
Is it possible to do it in a map-reduce way?
The reason to do it in a map-reduce way is that sometimes the group of Sets has a very large size, which makes it not possible to do it in the memory of a single machine. So we hope to do it in a map-reduce way, it may be not very efficient, but just work.
My problem is:
I don't know how to define a key for the key-value pair in the map-reduce process to group Sets properly.
I don't know when the process should be finished, that all the subsets have been removed.
EDIT:
The size of the data will keep growing larger in the future.
The input can be either a group of sets or multiple lines with each line containing a group of sets. Currently the input is val data = RDD[Set]
, I firstly do data.collect()
, which results in an overall group of sets. But I can modify the generation of the input into a RDD[Array[Set]]
, which will give me multiple lines with each line containing a group of sets.
The elements in each set can be sorted by modifying other parts of the program.
I doubt this can be done by a traditional map-reduce technique which is essentially a divide-and-conquer method. This is because:
A is not-a-subset-of B
and B is-not-subset-of C
, we cannot make any statement about A
w.r.t. C
.Based on the above observations this problem seems to be similar to duplicate detection and there is research on duplicate detection, for example here. Similar techniques will work well for the current problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With