How can I filter through a list which containing strings and substrings to return only the longest strings. (If any item in the list is a substring of another, return only the longer string.) I have this function. Is there a faster way? <pre class="prettyprint"><code>def filterSublist(lst): uniq = lst for elem in lst: uniq = [x for x in uniq if (x == elem) or (x not in elem)] return uniq lst = ["a", "abc", "b", "d", "xy", "xyz"] print filterSublist(lst) > ['abc', 'd', 'xyz'] > Function time: 0.000011 </code></pre>

A simple quadratic time solution would be this: <pre class="prettyprint"><code>res = [] n = len(lst) for i in xrange(n): if not any(i != j and lst[i] in lst[j] for j in xrange(n)): res.append(lst[i]) </code></pre> But we can do much better: Let <code>$</code> be a character that does not appear in any of your strings and has a lower value than all your actual characters. Let <code>S</code> be the concatenation of all your strings, with <code>$</code> in between. In your example, <code>S = a$abc$b$d$xy$xyz</code>. You can build the suffix array of <code>S</code> in linear time. You can also use a much simpler O(n log^2 n) construction algorithm that I described in another answer. Now for every string in <code>lst</code>, check if it occurs in the suffix array exactly once. You can do two binary searches to find the locations of the substring, they form a contiguous range in the suffix array. If the string occurs more than once, you remove it. With LCP information precomputed, this can be done in linear time as well. Example O(n log^2 n) implementation, adapted from my suffix array answer: <pre class="prettyprint"><code>def findFirst(lo, hi, pred): """ Find the first i in range(lo, hi) with pred(i) == True. Requires pred to be a monotone. If there is no such i, return hi. """ while lo < hi: mid = (lo + hi) // 2 if pred(mid): hi = mid; else: lo = mid + 1 return lo # uses the algorithm described in https://stackoverflow.com/a/21342145/916657 class SuffixArray(object): def __init__(self, s): """ build the suffix array of s in O(n log^2 n) where n = len(s). """ n = len(s) log2 = 0 while (1<<log2) < n: log2 += 1 rank = [[0]*n for _ in xrange(log2)] for i in xrange(n): rank[0][i] = s[i] L = [0]*n for step in xrange(1, log2): length = 1 << step for i in xrange(n): L[i] = (rank[step - 1][i], rank[step - 1][i + length // 2] if i + length // 2 < n else -1, i) L.sort() for i in xrange(n): rank[step][L[i][2]] = \ rank[step][L[i - 1][2]] if i > 0 and L[i][:2] == L[i-1][:2] else i self.log2 = log2 self.rank = rank self.sa = [l[2] for l in L] self.s = s self.rev = [0]*n for i, j in enumerate(self.sa): self.rev[j] = i def lcp(self, x, y): """ compute the longest common prefix of s[x:] and s[y:] in O(log n). """ n = len(self.s) if x == y: return n - x ret = 0 for k in xrange(self.log2 - 1, -1, -1): if x >= n or y >= n: break if self.rank[k][x] == self.rank[k][y]: x += 1<<k y += 1<<k ret += 1<<k return ret def compareSubstrings(self, x, lx, y, ly): """ compare substrings s[x:x+lx] and s[y:y+yl] in O(log n). """ l = min((self.lcp(x, y), lx, ly)) if l == lx == ly: return 0 if l == lx: return -1 if l == ly: return 1 return cmp(self.s[x + l], self.s[y + l]) def count(self, x, l): """ count occurences of substring s[x:x+l] in O(log n). """ n = len(self.s) cs = self.compareSubstrings lo = findFirst(0, n, lambda i: cs(self.sa[i], min(l, n - self.sa[i]), x, l) >= 0) hi = findFirst(0, n, lambda i: cs(self.sa[i], min(l, n - self.sa[i]), x, l) > 0) return hi - lo def debug(self): """ print the suffix array for debugging purposes. """ for i, j in enumerate(self.sa): print str(i).ljust(4), self.s[j:], self.lcp(self.sa[i], self.sa[i-1]) if i >0 else "n/a" def filterSublist(lst): splitter = "\x00" s = splitter.join(lst) + splitter sa = SuffixArray(s) res = [] offset = 0 for x in lst: if sa.count(offset, len(x)) == 1: res.append(x) offset += len(x) + 1 return res </code></pre> However, the interpretation overhead likely causes this to be slower than the O(n^2) approaches unless <code>S</code> is really large (in the order of 10^5 characters or more).

You can build your problem in matrix form as: <pre class="prettyprint"><code>import numpy as np lst = np.array(["a", "abc", "b", "d", "xy", "xyz"], object) out = np.zeros((len(lst), len(lst)), dtype=int) for i in range(len(lst)): for j in range(len(lst)): out[i,j] = lst[i] in lst[j] </code></pre> from where you get <code>out</code> as: <pre class="prettyprint"><code>array([[1, 1, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0], [0, 1, 1, 0, 0, 0], [0, 0, 0, 1, 0, 0], [0, 0, 0, 0, 1, 1], [0, 0, 0, 0, 0, 1]]) </code></pre> then, the answer will be the indices of <code>lst</code> where the sum of <code>òut</code> along the columns is <code>1</code> (the string is only in itself): <pre class="prettyprint"><code>lst[out.sum(axis=1)==1] #array(['abc', 'd', 'xyz'], dtype=object) </code></pre> <hr> EDIT: You can do it much more efficiently with: <pre class="prettyprint"><code>from numpy.lib.stride_tricks import as_strided from string import find size = len(lst) a = np.char.array(lst) a2 = np.char.array(as_strided(a, shape=(size, size), strides=(a.strides[0], 0))) out = find(a2, a) out[out==0] = 1 out[out==-1] = 0 print a[out.sum(axis=0)==1] # chararray(['abc', 'd', 'xyz'], dtype='|S3') a[out.sum(axis=0)==1] </code></pre>

Filter list of strings, ignoring substrings of other items

How can I filter through a list which containing strings and substrings to return only the longest strings. (If any item in the list is a substring of another, return only the longer string.)

I have this function. Is there a faster way?

def filterSublist(lst):
    uniq = lst
    for elem in lst:
        uniq = [x for x in uniq if (x == elem) or (x not in elem)]
    return uniq

lst = ["a", "abc", "b", "d", "xy", "xyz"]
print filterSublist(lst)

> ['abc', 'd', 'xyz']
> Function time: 0.000011

How do you filter a list of strings?

Filter a list of string using filter() method. filter() method accepts two parameters. The first parameter takes a function name or None and the second parameter takes the name of the list variable as values. filter() method stores those data from the list if it returns true, otherwise, it discards the data.

How do you check if a substring is in a list of strings Python?

Method #2 : Using any() The any function can be used to compute the presence of the test substring in all the strings of the list and return True if it's found in any. This is better than the above function as it doesn't explicitly take space to create new concatenated string.

How do I remove a substring from a string in Python?

In Python, the . replace() method and the re. sub() function are often used to clean up text by removing strings or substrings or replacing them.

A simple quadratic time solution would be this:

res = []
n = len(lst)
for i in xrange(n):
    if not any(i != j and lst[i] in lst[j] for j in xrange(n)):
        res.append(lst[i])

But we can do much better:

Let $ be a character that does not appear in any of your strings and has a lower value than all your actual characters.

Let S be the concatenation of all your strings, with $ in between. In your example, S = a$abc$b$d$xy$xyz.

You can build the suffix array of S in linear time. You can also use a much simpler O(n log^2 n) construction algorithm that I described in another answer.

Now for every string in lst, check if it occurs in the suffix array exactly once. You can do two binary searches to find the locations of the substring, they form a contiguous range in the suffix array. If the string occurs more than once, you remove it.

With LCP information precomputed, this can be done in linear time as well.

Example O(n log^2 n) implementation, adapted from my suffix array answer:

def findFirst(lo, hi, pred):
  """ Find the first i in range(lo, hi) with pred(i) == True.
  Requires pred to be a monotone. If there is no such i, return hi. """
  while lo < hi:
    mid = (lo + hi) // 2
    if pred(mid): hi = mid;
    else: lo = mid + 1
  return lo

# uses the algorithm described in https://stackoverflow.com/a/21342145/916657
class SuffixArray(object):
  def __init__(self, s):
    """ build the suffix array of s in O(n log^2 n) where n = len(s). """
    n = len(s)
    log2 = 0
    while (1<<log2) < n:
      log2 += 1
    rank = [[0]*n for _ in xrange(log2)]
    for i in xrange(n):
      rank[0][i] = s[i]
    L = [0]*n
    for step in xrange(1, log2):
      length = 1 << step
      for i in xrange(n):
        L[i] = (rank[step - 1][i],
                rank[step - 1][i + length // 2] if i + length // 2 < n else -1,
                i)
      L.sort()
      for i in xrange(n):
        rank[step][L[i][2]] = \
          rank[step][L[i - 1][2]] if i > 0 and L[i][:2] == L[i-1][:2] else i
    self.log2 = log2
    self.rank = rank
    self.sa = [l[2] for l in L]
    self.s = s
    self.rev = [0]*n
    for i, j in enumerate(self.sa):
      self.rev[j] = i

  def lcp(self, x, y):
    """ compute the longest common prefix of s[x:] and s[y:] in O(log n). """
    n = len(self.s)
    if x == y:
      return n - x
    ret = 0
    for k in xrange(self.log2 - 1, -1, -1):
      if x >= n or y >= n:
        break
      if self.rank[k][x] == self.rank[k][y]:
        x += 1<<k
        y += 1<<k
        ret += 1<<k
    return ret

  def compareSubstrings(self, x, lx, y, ly):
    """ compare substrings s[x:x+lx] and s[y:y+yl] in O(log n). """
    l = min((self.lcp(x, y), lx, ly))
    if l == lx == ly: return 0
    if l == lx: return -1
    if l == ly: return 1
    return cmp(self.s[x + l], self.s[y + l])

  def count(self, x, l):
    """ count occurences of substring s[x:x+l] in O(log n). """
    n = len(self.s)
    cs = self.compareSubstrings
    lo = findFirst(0, n, lambda i: cs(self.sa[i], min(l, n - self.sa[i]), x, l) >= 0)
    hi = findFirst(0, n, lambda i: cs(self.sa[i], min(l, n - self.sa[i]), x, l) > 0)
    return hi - lo

  def debug(self):
    """ print the suffix array for debugging purposes. """
    for i, j in enumerate(self.sa):
      print str(i).ljust(4), self.s[j:], self.lcp(self.sa[i], self.sa[i-1]) if i >0 else "n/a"

def filterSublist(lst):
  splitter = "\x00"
  s = splitter.join(lst) + splitter
  sa = SuffixArray(s)
  res = []
  offset = 0
  for x in lst:
    if sa.count(offset, len(x)) == 1:
      res.append(x)
    offset += len(x) + 1
  return res

However, the interpretation overhead likely causes this to be slower than the O(n^2) approaches unless S is really large (in the order of 10^5 characters or more).

You can build your problem in matrix form as:

import numpy as np

lst = np.array(["a", "abc", "b", "d", "xy", "xyz"], object)
out = np.zeros((len(lst), len(lst)), dtype=int)
for i in range(len(lst)):
    for j in range(len(lst)):
        out[i,j] = lst[i] in lst[j]

from where you get out as:

array([[1, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0],
       [0, 1, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 1, 1],
       [0, 0, 0, 0, 0, 1]])

then, the answer will be the indices of lst where the sum of òut along the columns is 1 (the string is only in itself):

lst[out.sum(axis=1)==1]
#array(['abc', 'd', 'xyz'], dtype=object)

EDIT: You can do it much more efficiently with:

from numpy.lib.stride_tricks import as_strided
from string import find

size = len(lst)
a = np.char.array(lst)
a2 = np.char.array(as_strided(a, shape=(size, size),
                                 strides=(a.strides[0], 0)))

out = find(a2, a)
out[out==0] = 1
out[out==-1] = 0
print a[out.sum(axis=0)==1]
# chararray(['abc', 'd', 'xyz'], dtype='|S3')

a[out.sum(axis=0)==1]

Does the order matter? If not,

a = ["a", "abc", "b", "d", "xy", "xyz"]

a.sort(key=len, reverse=True)
n = len(a)

for i in range(n - 1):
    if a[i]:
        for j in range(i + 1, n):
            if a[j] in a[i]:
                a[j] = ''


print filter(len, a)  # ['abc', 'xyz', 'd']

Not very efficient, but simple.

Filter list of strings, ignoring substrings of other items

Tags:

python

string

algorithm

Katerina

People also ask

3 Answers

Niklas B.

Saullo G. P. Castro

gog

Recent Activity

Donate For Us

Filter list of strings, ignoring substrings of other items

Tags:

python

string

algorithm

Katerina

People also ask

3 Answers

Niklas B.

Saullo G. P. Castro

gog

Related questions

Recent Activity

Donate For Us