I have some trouble with filtering a list of strings. I found a similar question here but is not what i need.
The input list is:
l = ['ab', 'xc', 'abb', 'abed', 'sdfdg', 'abfdsdg', 'xccc']
and the expected result is
['ab', 'xc', 'sdfdg']
The order of the items in the result is not important
The filter function must be fast because the size of list is big
My current solution is
l = ['ab', 'xc', 'abb', 'abed', 'sdfdg', 'abfdsdg', 'xccc']
for i in range(0, len(l) - 1):
for j in range(i + 1, len(l)):
if l[j].startswith(l[i]):
l[j] = l[i]
else:
if l[i].startswith(l[j]):
l[i] = l[j]
print list(set(l))
EDIT
After multiple tests with a big input data, list with 1500000 strings, my best solution for this is:
def filter(l):
if l==[]:
return []
l2=[]
l2.append(l[0])
llen = len(l)
k=0
itter = 0
while k<llen:
addkelem = ''
j=0
l2len = len(l2)
while j<l2len:
if (l2[j].startswith(l[k]) and l[k]!= l2[j]):
l2[j]=l[k]
l.remove(l[k])
llen-=1
j-=1
addkelem = ''
continue
if (l[k].startswith(l2[j])):
addkelem = ''
break
elif(l[k] not in l2):
addkelem = l[k]
j+=1
if addkelem != '':
l2.append(addkelem)
addkelem = ''
k+=1
return l2
for which the execution time is around of 213 seconds
Sample imput data - each line is a string in list
The startswith() method returns True if a string starts with the specified prefix(string). If not, it returns False .
Method #1 : Using list comprehension + startswith() In this method, we use list comprehension for traversal logic and the startswith method to filter out all the strings that starts with a particular letter. The remaining strings can be used to make a different list.
There are multiple ways to remove whitespace and other characters from a string in Python. The most commonly known methods are strip() , lstrip() , and rstrip() . Since Python version 3.9, two highly anticipated methods were introduced to remove the prefix or suffix of a string: removeprefix() and removesuffix() .
This algorithm completes the task in 0.97 second on my computer, with the input file submitted by the author (154MB):
l.sort()
last_str = l[0]
filtered = [last_str]
app = filtered.append
for str in l:
if not str.startswith(last_str):
last_str = str
app(str)
# Commented because of the massive amount of data to print.
# print filtered
The algorithm is simple: first sort the list lexicographically, then search for the first string which isn't prefixed by the very first one of the list, then search the next one which isn't prefixed by the last unprefixed one, etc.
If the list is already sorted (your example file seems to be already sorted), you can remove the l.sort()
line, which will result in a O(n) complexity in both time and memory.
You can group the items by first letter and just search the sublists, no string can start with a substring unless it has at least the same first letter:
from collections import defaultdict
def find(l):
d = defaultdict(list)
# group by first letter
for ele in l:
d[ele[0]].append(ele)
for val in d.values():
for v in val:
# check each substring in the sublist
if not any(v.startswith(s) and v != s for s in val):
yield v
print(list(find(l)))
['sdfdg', 'xc', 'ab']
This correctly filters the words, as you can see from the code below that the reduce function does not, 'tool'
should not be in the output:
In [56]: l = ["tool",'ab',"too", 'xc', 'abb',"toot", 'abed',"abel", 'sdfdg', 'abfdsdg', 'xccc',"xcew","xrew"]
In [57]: reduce(r,l)
Out[57]: ['tool', 'ab', 'too', 'xc', 'sdfdg', 'xrew']
In [58]: list(find(l))
Out[58]: ['sdfdg', 'too', 'xc', 'xrew', 'ab']
It also does it efficiently:
In [59]: l = ["".join(sample(ascii_lowercase, randint(2,25))) for _ in range(5000)]
In [60]: timeit reduce(r,l)
1 loops, best of 3: 2.12 s per loop
In [61]: timeit list(find(l))
1 loops, best of 3: 203 ms per loop
In [66]: %%timeit
..... result = []
....: for element in lst:
....: is_prefixed = False
....: for possible_prefix in lst:
....: if element is not possible_prefix and element.startswith(possible_prefix):
....: is_prefixed = True
....: break
....: if not is_prefixed:
....: result.append(element)
....:
1 loops, best of 3: 4.39 s per loop
In [92]: timeit list(my_filter(l))
1 loops, best of 3: 2.94 s per loop
If you know the minimum string length is always > 1 you can optimise further, again if the minimum length string is 2
then a word has to have a minimum of the first two letters in common:
def find(l):
d = defaultdict(list)
# find shortest length string to use as key length
mn = len(min(l, key=len))
for ele in l:
d[ele[:mn]].append(ele)
for val in d.values():
for v in val:
if not any(v.startswith(s) and v != s for s in val):
yield v
In [84]: timeit list(find(l))
100 loops, best of 3: 14.6 ms per loop
Lastly if you have dupes you may want to filter out the duplicated words from your list but you need to keep them to compare:
from collections import defaultdict,Counter
def find(l):
d = defaultdict(list)
mn = len(min(l, key=len))
cn = Counter(l)
for ele in l:
d[ele[:mn]].append(ele)
for val in d.values():
for v in val:
if not any(v.startswith(s) and v != s for s in val):
# make sure v is not a dupe
if cn[v] == 1:
yield v
So if speed is important, an implementation using some variation of the code above is going to be significantly faster than your naive approach. There is also more data stored in memory so you should also take the into account.
To save memory we can create a counter for each val/sublist so we only store a single counter dict of words at a time:
def find(l):
d = defaultdict(list)
mn = len(min(l, key=len))
for ele in l:
d[ele[:mn]].append(ele)
for val in d.values():
# we only need check each grouping of words for dupes
cn = Counter(val)
for v in val:
if not any(v.startswith(s) and v != s for s in val):
if cn[v] == 1:
yield v
creating a dict each loop adds 5ms so still < 20ms for 5k words.
The reduce method should work if the data is sorted:
reduce(r,sorted(l)) # -> ['ab', 'sdfdg', 'too', 'xc', 'xrew']
To make the difference clear between the behaviour:
In [202]: l = ["tool",'ab',"too", 'xc', 'abb',"toot", 'abed',
"abel", 'sdfdg', 'abfdsdg', 'xccc',"xcew","xrew","ab"]
In [203]: list(filter_list(l))
Out[203]: ['ab', 'too', 'xc', 'sdfdg', 'xrew', 'ab']
In [204]: list(find(l))
Out[204]: ['sdfdg', 'too', 'xc', 'xrew', 'ab', 'ab']
In [205]: reduce(r,sorted(l))
Out[205]: ['ab', 'sdfdg', 'too', 'xc', 'xrew']
In [206]: list(find_dupe(l))
Out[206]: ['too', 'xrew', 'xc', 'sdfdg']
In [207]: list(my_filter(l))
Out[207]: ['sdfdg', 'xrew', 'too', 'xc']
In [208]: "ab".startswith("ab")
Out[208]: True
So ab
is repeated twice so using a set or a dict without keeping track of how may times ab
appeared would mean we consider that there was no other element that satisfied the condition ab
"ab".startswith(other ) == True
, which we can see is incorrect.
You can also use itertools.groupby to group based on the min index size:
def find_dupe(l):
l.sort()
mn = len(min(l, key=len))
for k, val in groupby(l, key=lambda x: x[:mn]):
val = list(val)
for v in val:
cn = Counter(val)
if not any(v.startswith(s) and v != s for s in val) and cn[v] == 1:
yield v
Based on your comments then we can adjust my first code if you don't think "dd".startswith("dd")
should be True with repeated elements:
l = ['abbb', 'xc', 'abb', 'abed', 'sdfdg', 'xc','abfdsdg', 'xccc', 'd','dd','sdfdg', 'xc','abfdsdg', 'xccc', 'd','dd']
def find_with_dupe(l):
d = defaultdict(list)
# group by first letter
srt = sorted(set(l))
ind = len(srt[0])
for ele in srt:
d[ele[:ind]].append(ele)
for val in d.values():
for v in val:
# check each substring in the sublist
if not any(v.startswith(s) and v != s for s in val):
yield v
print(list(find_with_dupe(l)))
['abfdsdg', 'abed', 'abb', 'd', 'sdfdg', 'xc']
Which run on a random sample of text runs in a fraction of the time your own code does:
In [15]: l = open("/home/padraic/Downloads/sample.txt").read().split()
In [16]: timeit list(find(l))
100 loops, best of 3: 19 ms per loop
In [17]: %%timeit
....: l = open("/home/padraic/Downloads/sample.txt").read().split()
....: for i in range(0, len(l) - 1):
....: for j in range(i + 1, len(l)):
....: if l[j].startswith(l[i]):
....: l[j] = l[i]
....: else:
....: if l[i].startswith(l[j]):
....: l[i] = l[j]
....:
1 loops, best of 3: 4.92 s per loop
Both returning identical output:
In [41]: l = open("/home/padraic/Downloads/sample.txt").read().split()
In [42]:
for i in range(0, len(l) - 1):
for j in range(i + 1, len(l)):
if l[j].startswith(l[i]):
l[j] = l[i]
else:
if l[i].startswith(l[j]):
l[i] = l[j]
....:
In [43]:
In [43]: l2 = open("/home/padraic/Downloads/sample.txt").read().split()
In [44]: sorted(set(l)) == sorted(find(l2))
Out[44]: True
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With