I read through some of the posts about determining whether a set A
is a subset of another set B
. But I find it hard to determine what algorithm to use. Here is an outline of the problem:
A
which I receive at the beginning of my program. Not much is known about the structure. Each string in the array can be arbitrarily long and the number of entries is not limited. Though usually it can be assumed that the number of entries in the array won't be overly large (< 100).n
.n
objects will also have an array of strings B
, i.e. there will be n
B
arrays. Once the program runs the B
s will be fixed, i.e. they do not change during runtime.A
is a subset of B
.Now, I thought about hashtables. However, they would, in my opinion only be efficient if the was only one B
and a lot of A
s. Then I could make a hashtable for B
and check each string array of each object against my hashtable. But this is not the case because there is only one A
but n
B
s. What would be an efficient algorithm to do this?
Example:
A: ["A", "G", "T"]
B1: ["C", "G"]
B2: ["K", "A", "U", "T", "G"]
.
.
.
Bn: ["T", "I", "G", "O", "L"]
Here A
is a subset of B2
but not of B1
, and not of Bn
.
An efficient approach is to represent the set A as a trie. This allows to check if a given string belongs to A in time linear in the string length.
Then there is no better way then checking exhaustively for all Bi and all strings in Bi if it belongs to A. The search stops as soon as all strings in A have been matched (flagging a string when it has been found).
The running time will be at worse proportional to the total number of characters in all strings in all B. In practice, a significant fraction of the characters will be skipped, as
the search for a string not in A can terminate early,
the subset test can conclude positively even if there are strings left in Bi,
the subset test can conclude negatively when there are more unmatched strings in A than strings left in Bi.
This approach is certainly worst-case optimal as you read the characters at most once and perform a constant number of operations per character.
As a first approach I would pre-calculate some general properties of sets, which (hopefully) will allow you to filter some of B
s quickly. These may be, for example:
A
certainly can't be a subset of B
if it contains more elements than B
;A
certainly is not a subset of B
if the longest string in A
is longer than the longest string in B
;For easier checking you might require every set to be ordered alphabetically. That will allow checking A
against a single B
in a (linear) scan through both sets of strings.
For small A
and big B
sets it might be more efficient to look for a string in B
with binary search rather than with a linear scan; that would require pre-sorting of B
, too.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With