Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient algorithm to check for subset against a range of sets

I read through some of the posts about determining whether a set A is a subset of another set B. But I find it hard to determine what algorithm to use. Here is an outline of the problem:

  • I have an array of strings A which I receive at the beginning of my program. Not much is known about the structure. Each string in the array can be arbitrarily long and the number of entries is not limited. Though usually it can be assumed that the number of entries in the array won't be overly large (< 100).
  • Then I iterate through a list of objects of length n.
  • Each of the n objects will also have an array of strings B, i.e. there will be n B arrays. Once the program runs the Bs will be fixed, i.e. they do not change during runtime.
  • I want to determine for each object if A is a subset of B.

Now, I thought about hashtables. However, they would, in my opinion only be efficient if the was only one B and a lot of As. Then I could make a hashtable for B and check each string array of each object against my hashtable. But this is not the case because there is only one A but n Bs. What would be an efficient algorithm to do this?

Example:

A:  ["A", "G", "T"]
B1: ["C", "G"]
B2: ["K", "A", "U", "T", "G"]
.
.
.
Bn: ["T", "I", "G", "O", "L"]

Here A is a subset of B2 but not of B1, and not of Bn.

like image 262
lord.garbage Avatar asked Sep 26 '22 15:09

lord.garbage


2 Answers

An efficient approach is to represent the set A as a trie. This allows to check if a given string belongs to A in time linear in the string length.

Then there is no better way then checking exhaustively for all Bi and all strings in Bi if it belongs to A. The search stops as soon as all strings in A have been matched (flagging a string when it has been found).

The running time will be at worse proportional to the total number of characters in all strings in all B. In practice, a significant fraction of the characters will be skipped, as

  • the search for a string not in A can terminate early,

  • the subset test can conclude positively even if there are strings left in Bi,

  • the subset test can conclude negatively when there are more unmatched strings in A than strings left in Bi.

This approach is certainly worst-case optimal as you read the characters at most once and perform a constant number of operations per character.

like image 68
Yves Daoust Avatar answered Sep 29 '22 07:09

Yves Daoust


As a first approach I would pre-calculate some general properties of sets, which (hopefully) will allow you to filter some of Bs quickly. These may be, for example:

  • a number of strings – A certainly can't be a subset of B if it contains more elements than B;
  • a length of the longest string – A certainly is not a subset of B if the longest string in A is longer than the longest string in B;
  • a sum of strings' lengths.

For easier checking you might require every set to be ordered alphabetically. That will allow checking A against a single B in a (linear) scan through both sets of strings.

For small A and big B sets it might be more efficient to look for a string in B with binary search rather than with a linear scan; that would require pre-sorting of B, too.

like image 43
CiaPan Avatar answered Sep 29 '22 07:09

CiaPan