Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

algorithm: is there a map-reduce way to merge a group of sets by deleting all the subsets

The problem is: Suppose we have a group of Sets: Set(1,2,3) Set(1,2,3,4) Set(4,5,6) Set(1,2,3,4,6), we need to delete all the subsets and finally get the Result: Set(4,5,6) Set(1,2,3,4,6). (Since both Set(1,2,3) and Set(1,2,3,4) are the subsets of Set(1,2,3,4,6), both are removed.)

And suppose that the elements of the set have order, which can be Int, Char, etc.

Is it possible to do it in a map-reduce way?

The reason to do it in a map-reduce way is that sometimes the group of Sets has a very large size, which makes it not possible to do it in the memory of a single machine. So we hope to do it in a map-reduce way, it may be not very efficient, but just work.

My problem is:

  1. I don't know how to define a key for the key-value pair in the map-reduce process to group Sets properly.

  2. I don't know when the process should be finished, that all the subsets have been removed.

EDIT:

  1. The size of the data will keep growing larger in the future.

  2. The input can be either a group of sets or multiple lines with each line containing a group of sets. Currently the input is val data = RDD[Set], I firstly do data.collect(), which results in an overall group of sets. But I can modify the generation of the input into a RDD[Array[Set]], which will give me multiple lines with each line containing a group of sets.

  3. The elements in each set can be sorted by modifying other parts of the program.

like image 826
宇宙人 Avatar asked Dec 24 '15 02:12

宇宙人


1 Answers

I doubt this can be done by a traditional map-reduce technique which is essentially a divide-and-conquer method. This is because:

  • in this problem each set has to essentially be compared to all of the sets of larger cardinality whose min and max elements lie around the min and max of the smaller set.
  • unlike sorting and other problems amenable to map-reduce, we don't have a transitivity relation, i.e., if A is not-a-subset-of B and B is-not-subset-of C, we cannot make any statement about A w.r.t. C.

Based on the above observations this problem seems to be similar to duplicate detection and there is research on duplicate detection, for example here. Similar techniques will work well for the current problem.

like image 51
user1952500 Avatar answered Oct 19 '22 05:10

user1952500