Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Obtain a list containing string elements excluding elements prefixed with any other element from initial list

I have some trouble with filtering a list of strings. I found a similar question here but is not what i need.

The input list is:

l = ['ab', 'xc', 'abb', 'abed', 'sdfdg', 'abfdsdg', 'xccc']

and the expected result is

['ab', 'xc', 'sdfdg']

The order of the items in the result is not important

The filter function must be fast because the size of list is big

My current solution is

l = ['ab', 'xc', 'abb', 'abed', 'sdfdg', 'abfdsdg', 'xccc']
for i in range(0, len(l) - 1):
    for j in range(i + 1, len(l)):
        if l[j].startswith(l[i]):
            l[j] = l[i]
        else:
            if l[i].startswith(l[j]):
                l[i] = l[j]

print list(set(l)) 

EDIT

After multiple tests with a big input data, list with 1500000 strings, my best solution for this is:

def filter(l):
    if l==[]:
        return []
    l2=[]
    l2.append(l[0])
    llen = len(l)
    k=0
    itter = 0
    while k<llen:
        addkelem = ''
        j=0
        l2len = len(l2)
        while j<l2len:
            if (l2[j].startswith(l[k]) and l[k]!= l2[j]):
                l2[j]=l[k]
                l.remove(l[k])
                llen-=1
                j-=1
                addkelem = ''
                continue
            if (l[k].startswith(l2[j])):
                addkelem = ''
                break
            elif(l[k] not in l2):
                addkelem = l[k]
            j+=1
        if addkelem != '':
            l2.append(addkelem)
            addkelem = ''
        k+=1
    return l2

for which the execution time is around of 213 seconds

Sample imput data - each line is a string in list

like image 598
vasilenicusor Avatar asked May 12 '15 09:05

vasilenicusor


People also ask

How do you check a string is a prefix of another string in Python?

The startswith() method returns True if a string starts with the specified prefix(string). If not, it returns False .

How do you filter a string starting with Python?

Method #1 : Using list comprehension + startswith() In this method, we use list comprehension for traversal logic and the startswith method to filter out all the strings that starts with a particular letter. The remaining strings can be used to make a different list.

How do you remove prefix and suffix in Python?

There are multiple ways to remove whitespace and other characters from a string in Python. The most commonly known methods are strip() , lstrip() , and rstrip() . Since Python version 3.9, two highly anticipated methods were introduced to remove the prefix or suffix of a string: removeprefix() and removesuffix() .


2 Answers

This algorithm completes the task in 0.97 second on my computer, with the input file submitted by the author (154MB):

l.sort()

last_str = l[0]
filtered = [last_str]
app      = filtered.append

for str in l:
    if not str.startswith(last_str):
        last_str = str
        app(str)

# Commented because of the massive amount of data to print.
# print filtered

The algorithm is simple: first sort the list lexicographically, then search for the first string which isn't prefixed by the very first one of the list, then search the next one which isn't prefixed by the last unprefixed one, etc.

If the list is already sorted (your example file seems to be already sorted), you can remove the l.sort() line, which will result in a O(n) complexity in both time and memory.

like image 162
Benoit Esnard Avatar answered Oct 13 '22 18:10

Benoit Esnard


You can group the items by first letter and just search the sublists, no string can start with a substring unless it has at least the same first letter:

from collections import defaultdict

def find(l):
    d = defaultdict(list)
    # group by first letter
    for ele in l:
        d[ele[0]].append(ele)
    for val in d.values():
        for v in val:
            # check each substring in the sublist
            if not any(v.startswith(s) and v != s  for s in val):
                yield v

print(list(find(l)))
['sdfdg', 'xc', 'ab']

This correctly filters the words, as you can see from the code below that the reduce function does not, 'tool' should not be in the output:

In [56]: l = ["tool",'ab',"too", 'xc', 'abb',"toot", 'abed',"abel", 'sdfdg', 'abfdsdg', 'xccc',"xcew","xrew"]

In [57]: reduce(r,l)
Out[57]: ['tool', 'ab', 'too', 'xc', 'sdfdg', 'xrew']

In [58]: list(find(l))
Out[58]: ['sdfdg', 'too', 'xc', 'xrew', 'ab']

It also does it efficiently:

In [59]: l = ["".join(sample(ascii_lowercase, randint(2,25))) for _ in range(5000)]

In [60]: timeit reduce(r,l)
1 loops, best of 3: 2.12 s per loop

In [61]: timeit list(find(l))
1 loops, best of 3: 203 ms per loop

In [66]: %%timeit
..... result = []
....: for element in lst:
....:   is_prefixed = False
....:   for possible_prefix in lst:
....:     if element is not possible_prefix and  element.startswith(possible_prefix):
....:       is_prefixed = True
....:       break
....:   if not is_prefixed:
....:     result.append(element)
....: 
1 loops, best of 3: 4.39 s per loop

In [92]: timeit list(my_filter(l))
1 loops, best of 3: 2.94 s per loop

If you know the minimum string length is always > 1 you can optimise further, again if the minimum length string is 2 then a word has to have a minimum of the first two letters in common:

def find(l):
    d = defaultdict(list)
    # find shortest length string to use as key length
    mn = len(min(l, key=len))
    for ele in l:
        d[ele[:mn]].append(ele)

    for val in d.values():
        for v in val:
            if not any(v.startswith(s) and v != s for s in val):
                yield v


In [84]: timeit list(find(l))
100 loops, best of 3: 14.6 ms per loop

Lastly if you have dupes you may want to filter out the duplicated words from your list but you need to keep them to compare:

from collections import defaultdict,Counter

def find(l):
    d = defaultdict(list)
    mn = len(min(l, key=len))
    cn = Counter(l)
    for ele in l:
        d[ele[:mn]].append(ele)
    for val in d.values():
        for v in val:
            if not any(v.startswith(s) and v != s for s in val): 
                # make sure v is not a dupe
                if cn[v] == 1:
                    yield v

So if speed is important, an implementation using some variation of the code above is going to be significantly faster than your naive approach. There is also more data stored in memory so you should also take the into account.

To save memory we can create a counter for each val/sublist so we only store a single counter dict of words at a time:

def find(l):
    d = defaultdict(list)
    mn = len(min(l, key=len))
    for ele in l:
        d[ele[:mn]].append(ele)
    for val in d.values():
        # we only need check each grouping of words for dupes
        cn = Counter(val)
        for v in val:
            if not any(v.startswith(s) and v != s for s in val):
                if cn[v] == 1:
                    yield v

creating a dict each loop adds 5ms so still < 20ms for 5k words.

The reduce method should work if the data is sorted:

 reduce(r,sorted(l)) # -> ['ab', 'sdfdg', 'too', 'xc', 'xrew']

To make the difference clear between the behaviour:

In [202]: l = ["tool",'ab',"too", 'xc', 'abb',"toot", 'abed',
             "abel", 'sdfdg', 'abfdsdg', 'xccc',"xcew","xrew","ab"]

In [203]: list(filter_list(l))
Out[203]: ['ab', 'too', 'xc', 'sdfdg', 'xrew', 'ab']

In [204]: list(find(l))
Out[204]: ['sdfdg', 'too', 'xc', 'xrew', 'ab', 'ab']

In [205]: reduce(r,sorted(l))
Out[205]: ['ab', 'sdfdg', 'too', 'xc', 'xrew']

In [206]: list(find_dupe(l))
Out[206]: ['too', 'xrew', 'xc', 'sdfdg']

In [207]: list(my_filter(l))
Out[207]: ['sdfdg', 'xrew', 'too', 'xc']
In [208]: "ab".startswith("ab")
Out[208]: True

So ab is repeated twice so using a set or a dict without keeping track of how may times ab appeared would mean we consider that there was no other element that satisfied the condition ab "ab".startswith(other ) == True, which we can see is incorrect.

You can also use itertools.groupby to group based on the min index size:

def find_dupe(l):
    l.sort()
    mn = len(min(l, key=len))
    for k, val in groupby(l, key=lambda x: x[:mn]):
        val = list(val)
        for v in val:
            cn = Counter(val)
            if not any(v.startswith(s) and v != s for s in val) and cn[v] == 1:
                yield v

Based on your comments then we can adjust my first code if you don't think "dd".startswith("dd") should be True with repeated elements:

l = ['abbb', 'xc', 'abb', 'abed', 'sdfdg', 'xc','abfdsdg', 'xccc', 'd','dd','sdfdg', 'xc','abfdsdg', 'xccc', 'd','dd']


def find_with_dupe(l):
    d = defaultdict(list)
    # group by first letter
    srt = sorted(set(l))
    ind = len(srt[0])
    for ele in srt:
        d[ele[:ind]].append(ele)
    for val in d.values():
        for v in val:
            # check each substring in the sublist
            if not any(v.startswith(s) and v != s for s in val):
                yield v


print(list(find_with_dupe(l)))

['abfdsdg', 'abed', 'abb', 'd', 'sdfdg', 'xc']

Which run on a random sample of text runs in a fraction of the time your own code does:

In [15]: l = open("/home/padraic/Downloads/sample.txt").read().split()

In [16]: timeit list(find(l))                                         
100 loops, best of 3: 19 ms per loop

In [17]: %%timeit
   ....: l = open("/home/padraic/Downloads/sample.txt").read().split()
   ....: for i in range(0, len(l) - 1):
   ....:     for j in range(i + 1, len(l)):
   ....:         if l[j].startswith(l[i]):
   ....:             l[j] = l[i]
   ....:         else:
   ....:             if l[i].startswith(l[j]):
   ....:                 l[i] = l[j]
   ....: 

1 loops, best of 3: 4.92 s per loop

Both returning identical output:

In [41]: l = open("/home/padraic/Downloads/sample.txt").read().split()

In [42]:  
for i in range(0, len(l) - 1):
    for j in range(i + 1, len(l)):
        if l[j].startswith(l[i]):
            l[j] = l[i]
        else:
            if l[i].startswith(l[j]):
                l[i] = l[j]
   ....:                 


In [43]: 

In [43]: l2 = open("/home/padraic/Downloads/sample.txt").read().split()

In [44]: sorted(set(l)) == sorted(find(l2))
Out[44]: True
like image 32
Padraic Cunningham Avatar answered Oct 13 '22 18:10

Padraic Cunningham