Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding superstrings in a set of strings in python

Tags:

python

string

Given a set strings of containing numbers, how can I find those strings that are the superset. For example if strings '139 24' and '139 277 24' appear then I want to keep '139 277 24' as '139 24' can be found inside it. Also these numbers may appear in any order inside a string.

               '24'
              '277'
           '277 24'
           '139 24'
       '139 277 24'
          '139 277'
              '139'
           '136 24'
       '136 277 24'
          '136 277'
              '136'
       '136 139 24'
   '136 139 277 24'
      '136 139 277'
          '136 139'
              '246'

The result for the above data is given below.

   '136 139 277 24'
              '246'

Edit: I am splitting each string and putting individual numbers in a set and then comparing this through the sets created from the whole list. I can find a solution using this approach but I think there should be some other elegant way to perform the same.

I was trying the following code and felt that it is becoming unnecessarily complex.

#First create a set of tuples
allSeqsTuple = set()
for seq in allSeqs: #allSeqs store the sequences described above
    x = seq.split()
    allSeqsTuple.add(tuple(x))


#For each 'allSeqs', find if all the items in that seq that is in 'allSeqsTuple'. 
for line in allSeqs:
    x = set(line.split())
    result = findContainment(x, allSeqsTuple)
    ......
    ......

def findContainment(x, allSeqsTuple):
    contained = False
    for y in allSeqsTuple:
        cntnd = bool(x-set(y))
        if (cntnd):
            contained = True
            continue
        else:
            break
    return contained
like image 436
user1140126 Avatar asked Jan 16 '13 16:01

user1140126


1 Answers

Let's make a laundry list of the major players in this problem:

  • strings, e.g. '24 139 277'
  • sets -- a collection of "superstrings"
  • superset inclusion -- the <= set operator
  • splitting the strings into a set of number-strings: e.g. set(['24', '139', '277'])

We are given a list of strings, but what we'd really like -- what would be more useful -- is a list of sets:

In [20]: strings = [frozenset(s.split()) for s in strings]    
In [21]: strings
Out[21]: 
[frozenset(['24']),
 frozenset(['277']),
 ...
 frozenset(['136', '139']),
 frozenset(['246'])]

The reason for frozensets will become apparent shortly. I'll explain why, below. The reason why we want sets at all is because that have a convenient superset comparison operator:

In [22]: frozenset(['136']) <= frozenset(['136', '139', '24'])
Out[22]: True

In [23]: frozenset(['136']) <= frozenset(['24', '277'])
Out[23]: False

This is exactly what we need to determine if one string is a superstring of another.

So, basically, we want to:

  • Start with an empty set of superstrings = set()
  • Iterate through strings: for s in strings.
  • As we examine each s in strings, we will add new ones to superstrings if they are not a subset of a item already in superstrings.
  • For each s, iterate through a set of superstrings: for sup in superstrings.

    • Check if s <= sup -- that is, if s is a subset of sup, quit the loop since s is smaller than some known superstring.

    • Check if sup <= s -- that is, if s a superset of some item in superstrings. In this case, remove the item in superstrings and replace it with s.

Technical notes:

  • Because we are removing items from superstrings, we can not also iterate over superstrings itself. So, instead, iterate over a copy:

    for sup in superstrings.copy():
    
  • And finally, we would like superstrings to be a set of sets. But the items in a set have to be hashable, and sets themselves are not hashable. But frozensets are, so it is possible to have a set of frozensets. This is why we converted strings into a list of frozensets.

strings = [
    '24', '277', '277 24', '139 24', '139 277 24', '139 277', '139', '136 24',
    '136 277 24', '136 277', '136', '136 139 24', '136 139 277 24', '136 139 277',
    '136 139', '246']

def find_supersets(strings):
    superstrings = set()
    set_to_string = dict(zip([frozenset(s.split()) for s in strings], strings))
    for s in set_to_string.keys():
        for sup in superstrings.copy():
            if s <= sup:
                # print('{s!r} <= {sup!r}'.format(s = s, sup = sup))
                break
            elif sup < s:
                # print('{sup!r} <= {s!r}'.format(s = s, sup = sup))
                superstrings.remove(sup)
        else:
            superstrings.add(s)
    return [set_to_string[sup] for sup in superstrings]

print(find_supersets(strings))

yields

['136 139 277 24', '246']

It turns out this is faster than pre-sorting the strings:

def using_sorted(strings):
    stsets = sorted(
        (frozenset(s.split()) for s in strings), key=len, reverse=True)
    superstrings = set()
    for stset in stsets:
        if not any(stset.issubset(s) for s in superstrings):
            superstrings.add(stset)
    return superstrings

In [29]: timeit find_supersets(strings)
100000 loops, best of 3: 18.3 us per loop
In [25]: timeit using_sorted(strings)
10000 loops, best of 3: 24.9 us per loop
like image 138
unutbu Avatar answered Nov 14 '22 22:11

unutbu