How can I filter through a list which containing strings and substrings to return only the longest strings. (If any item in the list is a substring of another, return only the longer string.)
I have this function. Is there a faster way?
def filterSublist(lst):
uniq = lst
for elem in lst:
uniq = [x for x in uniq if (x == elem) or (x not in elem)]
return uniq
lst = ["a", "abc", "b", "d", "xy", "xyz"]
print filterSublist(lst)
> ['abc', 'd', 'xyz']
> Function time: 0.000011
Filter a list of string using filter() method. filter() method accepts two parameters. The first parameter takes a function name or None and the second parameter takes the name of the list variable as values. filter() method stores those data from the list if it returns true, otherwise, it discards the data.
Method #2 : Using any() The any function can be used to compute the presence of the test substring in all the strings of the list and return True if it's found in any. This is better than the above function as it doesn't explicitly take space to create new concatenated string.
In Python, the . replace() method and the re. sub() function are often used to clean up text by removing strings or substrings or replacing them.
A simple quadratic time solution would be this:
res = []
n = len(lst)
for i in xrange(n):
if not any(i != j and lst[i] in lst[j] for j in xrange(n)):
res.append(lst[i])
But we can do much better:
Let $
be a character that does not appear in any of your strings and has a lower value than all your actual characters.
Let S
be the concatenation of all your strings, with $
in between. In your example, S = a$abc$b$d$xy$xyz
.
You can build the suffix array of S
in linear time. You can also use a much simpler O(n log^2 n) construction algorithm that I described in another answer.
Now for every string in lst
, check if it occurs in the suffix array exactly once. You can do two binary searches to find the locations of the substring, they form a contiguous range in the suffix array. If the string occurs more than once, you remove it.
With LCP information precomputed, this can be done in linear time as well.
Example O(n log^2 n) implementation, adapted from my suffix array answer:
def findFirst(lo, hi, pred):
""" Find the first i in range(lo, hi) with pred(i) == True.
Requires pred to be a monotone. If there is no such i, return hi. """
while lo < hi:
mid = (lo + hi) // 2
if pred(mid): hi = mid;
else: lo = mid + 1
return lo
# uses the algorithm described in https://stackoverflow.com/a/21342145/916657
class SuffixArray(object):
def __init__(self, s):
""" build the suffix array of s in O(n log^2 n) where n = len(s). """
n = len(s)
log2 = 0
while (1<<log2) < n:
log2 += 1
rank = [[0]*n for _ in xrange(log2)]
for i in xrange(n):
rank[0][i] = s[i]
L = [0]*n
for step in xrange(1, log2):
length = 1 << step
for i in xrange(n):
L[i] = (rank[step - 1][i],
rank[step - 1][i + length // 2] if i + length // 2 < n else -1,
i)
L.sort()
for i in xrange(n):
rank[step][L[i][2]] = \
rank[step][L[i - 1][2]] if i > 0 and L[i][:2] == L[i-1][:2] else i
self.log2 = log2
self.rank = rank
self.sa = [l[2] for l in L]
self.s = s
self.rev = [0]*n
for i, j in enumerate(self.sa):
self.rev[j] = i
def lcp(self, x, y):
""" compute the longest common prefix of s[x:] and s[y:] in O(log n). """
n = len(self.s)
if x == y:
return n - x
ret = 0
for k in xrange(self.log2 - 1, -1, -1):
if x >= n or y >= n:
break
if self.rank[k][x] == self.rank[k][y]:
x += 1<<k
y += 1<<k
ret += 1<<k
return ret
def compareSubstrings(self, x, lx, y, ly):
""" compare substrings s[x:x+lx] and s[y:y+yl] in O(log n). """
l = min((self.lcp(x, y), lx, ly))
if l == lx == ly: return 0
if l == lx: return -1
if l == ly: return 1
return cmp(self.s[x + l], self.s[y + l])
def count(self, x, l):
""" count occurences of substring s[x:x+l] in O(log n). """
n = len(self.s)
cs = self.compareSubstrings
lo = findFirst(0, n, lambda i: cs(self.sa[i], min(l, n - self.sa[i]), x, l) >= 0)
hi = findFirst(0, n, lambda i: cs(self.sa[i], min(l, n - self.sa[i]), x, l) > 0)
return hi - lo
def debug(self):
""" print the suffix array for debugging purposes. """
for i, j in enumerate(self.sa):
print str(i).ljust(4), self.s[j:], self.lcp(self.sa[i], self.sa[i-1]) if i >0 else "n/a"
def filterSublist(lst):
splitter = "\x00"
s = splitter.join(lst) + splitter
sa = SuffixArray(s)
res = []
offset = 0
for x in lst:
if sa.count(offset, len(x)) == 1:
res.append(x)
offset += len(x) + 1
return res
However, the interpretation overhead likely causes this to be slower than the O(n^2) approaches unless S
is really large (in the order of 10^5 characters or more).
You can build your problem in matrix form as:
import numpy as np
lst = np.array(["a", "abc", "b", "d", "xy", "xyz"], object)
out = np.zeros((len(lst), len(lst)), dtype=int)
for i in range(len(lst)):
for j in range(len(lst)):
out[i,j] = lst[i] in lst[j]
from where you get out
as:
array([[1, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0],
[0, 1, 1, 0, 0, 0],
[0, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 1, 1],
[0, 0, 0, 0, 0, 1]])
then, the answer will be the indices of lst
where the sum of òut
along the columns is 1
(the string is only in itself):
lst[out.sum(axis=1)==1]
#array(['abc', 'd', 'xyz'], dtype=object)
EDIT: You can do it much more efficiently with:
from numpy.lib.stride_tricks import as_strided
from string import find
size = len(lst)
a = np.char.array(lst)
a2 = np.char.array(as_strided(a, shape=(size, size),
strides=(a.strides[0], 0)))
out = find(a2, a)
out[out==0] = 1
out[out==-1] = 0
print a[out.sum(axis=0)==1]
# chararray(['abc', 'd', 'xyz'], dtype='|S3')
a[out.sum(axis=0)==1]
Does the order matter? If not,
a = ["a", "abc", "b", "d", "xy", "xyz"]
a.sort(key=len, reverse=True)
n = len(a)
for i in range(n - 1):
if a[i]:
for j in range(i + 1, n):
if a[j] in a[i]:
a[j] = ''
print filter(len, a) # ['abc', 'xyz', 'd']
Not very efficient, but simple.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With