Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to assign many subsets to their largest supersets?

Tags:

algorithm

My data has large number of sets (few millions). Each of those set size is between few members to several tens of thousands integers. Many of those sets are subsets of larger sets (there are many of those super-sets). I'm trying to assign each subset to it's largest superset.

Please can anyone recommend algorithm for this type of task? There are many algorithms for generating all possible sub-sets of a set, but this type of approach is time-prohibitive given my data size (e.g. this paper or SO question).

Example of my data-set:

A {1, 2, 3}
B {1, 3}
C {2, 4}
D {2, 4, 9}
E {3, 5}
F {1, 2, 3, 7}

Expected answer: B and A are subset of F (it's not important B is also subset of A); C is a subset of D; E remains unassigned.

like image 880
Dan Avatar asked Oct 30 '22 07:10

Dan


1 Answers

Here's an idea that might work:

  • Build a table that maps number to a sorted list of sets, sorted first by size with largest first, and then, by size, arbitrarily but with some canonical order. (Say, alphabetically by set name.) So in your example, you'd have a table that maps 1 to [F, A, B], 2 to [F, A, D, C], 3 to [F, A, B, E] and so on. This can be implemented to take O(n log n) time where n is the total size of the input.
  • For each set in the input:
    • fetch the lists associated with each entry in that set. So for A, you'd get the lists associated with 1, 2, and 3. The total number of selects you'll issue in the runtime of the whole algorithm is O(n), so runtime so far is O(n log n + n) which is still O(n log n).
    • Now walk down each list simultaneously. If a set is the first entry in all three lists, then it's the largest set that contains the input set. Output that association and continue with the next input list. If not, then discard the smallest item among all the items in the input lists and try again. Implementing this last bit is tricky, but you can store the heads of all lists in a heap and get (IIRC) something like O(n log k) overall runtime where k is the maximum size of any individual set, so you can bound that at O(n log n) in the worst case.

So if I got everything straight, the runtime of the algorithm is overall O(n log n), which seems like probably as good as you're going to get for this problem.

Here is a python implementation of the algorithm:

from collections import defaultdict, deque
import heapq

def LargestSupersets(setlists):
  '''Computes, for each item in the input, the largest superset in the same input.

setlists: A list of lists, each of which represents a set of items. Items must be hashable.
  '''
  # First, build a table that maps each element in any input setlist to a list of records
  # of the form (-size of setlist, index of setlist), one for each setlist that contains
  # the corresponding element
  element_to_entries = defaultdict(list)
  for idx, setlist in enumerate(setlists):
    entry = (-len(setlist), idx)  # cheesy way to make an entry that sorts properly -- largest first
    for element in setlist:
      element_to_entries[element].append(entry)

  # Within each entry, sort so that larger items come first, with ties broken arbitrarily by
  # the set's index
  for entries in element_to_entries.values():
    entries.sort()

  # Now build up the output by going over each setlist and walking over the entries list for
  # each element in the setlist. Since the entries list for each element is sorted largest to
  # smallest, the first entry we find that is in every entry set we pulled will be the largest
  # element of the input that contains each item in this setlist. We are guaranteed to eventually
  # find such an element because, at the very least, the item we're iterating on itself is in
  # each entries list.
  output = []
  for idx, setlist in enumerate(setlists):
    num_elements = len(setlist)
    buckets = [element_to_entries[element] for element in setlist]

    # We implement the search for an item that appears in every list by maintaining a heap and
    # a queue. We have the invariants that:
    #   1. The queue contains the n smallest items across all the buckets, in order
    #   2. The heap contains the smallest item from each bucket that has not already passed through
    #        the queue.
    smallest_entries_heap = []
    smallest_entries_deque = deque([], num_elements)
    for bucket_idx, bucket in enumerate(buckets):
      smallest_entries_heap.append((bucket[0], bucket_idx, 0))
    heapq.heapify(smallest_entries_heap)

    while (len(smallest_entries_deque) < num_elements or
           smallest_entries_deque[0] != smallest_entries_deque[num_elements - 1]):
      # First extract the next smallest entry in the queue ...
      (smallest_entry, bucket_idx, element_within_bucket_idx) = heapq.heappop(smallest_entries_heap)
      smallest_entries_deque.append(smallest_entry)

      # ... then add the next-smallest item from the bucket that we just removed an element from
      if element_within_bucket_idx + 1 < len(buckets[bucket_idx]):
        new_element = buckets[bucket_idx][element_within_bucket_idx + 1]
        heapq.heappush(smallest_entries_heap, (new_element, bucket_idx, element_within_bucket_idx + 1))

    output.append((idx, smallest_entries_deque[0][1]))

  return output

Note: don't trust my writeup too much here. I just thought of this algorithm right now, I haven't proved it correct or anything.

like image 78
jacobm Avatar answered Jan 02 '23 21:01

jacobm