Given a set strings of containing numbers, how can I find those strings that are the superset. For example if strings '139 24' and '139 277 24' appear then I want to keep '139 277 24' as '139 24' can be found inside it. Also these numbers may appear in any order inside a string.
'24'
'277'
'277 24'
'139 24'
'139 277 24'
'139 277'
'139'
'136 24'
'136 277 24'
'136 277'
'136'
'136 139 24'
'136 139 277 24'
'136 139 277'
'136 139'
'246'
The result for the above data is given below.
'136 139 277 24'
'246'
Edit: I am splitting each string and putting individual numbers in a set and then comparing this through the sets created from the whole list. I can find a solution using this approach but I think there should be some other elegant way to perform the same.
I was trying the following code and felt that it is becoming unnecessarily complex.
#First create a set of tuples
allSeqsTuple = set()
for seq in allSeqs: #allSeqs store the sequences described above
x = seq.split()
allSeqsTuple.add(tuple(x))
#For each 'allSeqs', find if all the items in that seq that is in 'allSeqsTuple'.
for line in allSeqs:
x = set(line.split())
result = findContainment(x, allSeqsTuple)
......
......
def findContainment(x, allSeqsTuple):
contained = False
for y in allSeqsTuple:
cntnd = bool(x-set(y))
if (cntnd):
contained = True
continue
else:
break
return contained
Let's make a laundry list of the major players in this problem:
'24 139 277'
<=
set operatorset(['24', '139', '277'])
We are given a list of strings, but what we'd really like -- what would be more useful -- is a list of sets:
In [20]: strings = [frozenset(s.split()) for s in strings]
In [21]: strings
Out[21]:
[frozenset(['24']),
frozenset(['277']),
...
frozenset(['136', '139']),
frozenset(['246'])]
The reason for frozensets will become apparent shortly. I'll explain why, below. The reason why we want sets at all is because that have a convenient superset comparison operator:
In [22]: frozenset(['136']) <= frozenset(['136', '139', '24'])
Out[22]: True
In [23]: frozenset(['136']) <= frozenset(['24', '277'])
Out[23]: False
This is exactly what we need to determine if one string is a superstring of another.
So, basically, we want to:
superstrings = set()
for s in strings
.s
in strings
, we will add new ones to
superstrings
if they are not a subset of a item already in
superstrings
.For each s
, iterate through a set of superstrings
: for sup in superstrings
.
Check if s <= sup
-- that is, if s
is a subset of sup
, quit the loop since s
is smaller than some known superstring.
Check if sup <= s
-- that is, if
s
a superset of some item in superstrings
. In this case, remove the item in superstrings
and replace it with s
.
Technical notes:
Because we are removing items from superstrings
, we can not also
iterate over superstrings
itself. So, instead, iterate over a copy:
for sup in superstrings.copy():
superstrings
to be a set of sets. But
the items in a set have to be hashable, and sets themselves are not
hashable. But frozensets are, so it is possible to have a set of
frozensets. This is why we converted strings
into a list of
frozensets
.strings = [
'24', '277', '277 24', '139 24', '139 277 24', '139 277', '139', '136 24',
'136 277 24', '136 277', '136', '136 139 24', '136 139 277 24', '136 139 277',
'136 139', '246']
def find_supersets(strings):
superstrings = set()
set_to_string = dict(zip([frozenset(s.split()) for s in strings], strings))
for s in set_to_string.keys():
for sup in superstrings.copy():
if s <= sup:
# print('{s!r} <= {sup!r}'.format(s = s, sup = sup))
break
elif sup < s:
# print('{sup!r} <= {s!r}'.format(s = s, sup = sup))
superstrings.remove(sup)
else:
superstrings.add(s)
return [set_to_string[sup] for sup in superstrings]
print(find_supersets(strings))
yields
['136 139 277 24', '246']
It turns out this is faster than pre-sorting the strings:
def using_sorted(strings):
stsets = sorted(
(frozenset(s.split()) for s in strings), key=len, reverse=True)
superstrings = set()
for stset in stsets:
if not any(stset.issubset(s) for s in superstrings):
superstrings.add(stset)
return superstrings
In [29]: timeit find_supersets(strings)
100000 loops, best of 3: 18.3 us per loop
In [25]: timeit using_sorted(strings)
10000 loops, best of 3: 24.9 us per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With