Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter list of strings, ignoring substrings of other items

How can I filter through a list which containing strings and substrings to return only the longest strings. (If any item in the list is a substring of another, return only the longer string.)

I have this function. Is there a faster way?

def filterSublist(lst):
    uniq = lst
    for elem in lst:
        uniq = [x for x in uniq if (x == elem) or (x not in elem)]
    return uniq

lst = ["a", "abc", "b", "d", "xy", "xyz"]
print filterSublist(lst)

> ['abc', 'd', 'xyz']
> Function time: 0.000011
like image 729
Katerina Avatar asked May 26 '14 10:05

Katerina


People also ask

How do you filter a list of strings?

Filter a list of string using filter() method. filter() method accepts two parameters. The first parameter takes a function name or None and the second parameter takes the name of the list variable as values. filter() method stores those data from the list if it returns true, otherwise, it discards the data.

How do you check if a substring is in a list of strings Python?

Method #2 : Using any() The any function can be used to compute the presence of the test substring in all the strings of the list and return True if it's found in any. This is better than the above function as it doesn't explicitly take space to create new concatenated string.

How do I remove a substring from a string in Python?

In Python, the . replace() method and the re. sub() function are often used to clean up text by removing strings or substrings or replacing them.


3 Answers

A simple quadratic time solution would be this:

res = []
n = len(lst)
for i in xrange(n):
    if not any(i != j and lst[i] in lst[j] for j in xrange(n)):
        res.append(lst[i])

But we can do much better:

Let $ be a character that does not appear in any of your strings and has a lower value than all your actual characters.

Let S be the concatenation of all your strings, with $ in between. In your example, S = a$abc$b$d$xy$xyz.

You can build the suffix array of S in linear time. You can also use a much simpler O(n log^2 n) construction algorithm that I described in another answer.

Now for every string in lst, check if it occurs in the suffix array exactly once. You can do two binary searches to find the locations of the substring, they form a contiguous range in the suffix array. If the string occurs more than once, you remove it.

With LCP information precomputed, this can be done in linear time as well.

Example O(n log^2 n) implementation, adapted from my suffix array answer:

def findFirst(lo, hi, pred):
  """ Find the first i in range(lo, hi) with pred(i) == True.
  Requires pred to be a monotone. If there is no such i, return hi. """
  while lo < hi:
    mid = (lo + hi) // 2
    if pred(mid): hi = mid;
    else: lo = mid + 1
  return lo

# uses the algorithm described in https://stackoverflow.com/a/21342145/916657
class SuffixArray(object):
  def __init__(self, s):
    """ build the suffix array of s in O(n log^2 n) where n = len(s). """
    n = len(s)
    log2 = 0
    while (1<<log2) < n:
      log2 += 1
    rank = [[0]*n for _ in xrange(log2)]
    for i in xrange(n):
      rank[0][i] = s[i]
    L = [0]*n
    for step in xrange(1, log2):
      length = 1 << step
      for i in xrange(n):
        L[i] = (rank[step - 1][i],
                rank[step - 1][i + length // 2] if i + length // 2 < n else -1,
                i)
      L.sort()
      for i in xrange(n):
        rank[step][L[i][2]] = \
          rank[step][L[i - 1][2]] if i > 0 and L[i][:2] == L[i-1][:2] else i
    self.log2 = log2
    self.rank = rank
    self.sa = [l[2] for l in L]
    self.s = s
    self.rev = [0]*n
    for i, j in enumerate(self.sa):
      self.rev[j] = i

  def lcp(self, x, y):
    """ compute the longest common prefix of s[x:] and s[y:] in O(log n). """
    n = len(self.s)
    if x == y:
      return n - x
    ret = 0
    for k in xrange(self.log2 - 1, -1, -1):
      if x >= n or y >= n:
        break
      if self.rank[k][x] == self.rank[k][y]:
        x += 1<<k
        y += 1<<k
        ret += 1<<k
    return ret

  def compareSubstrings(self, x, lx, y, ly):
    """ compare substrings s[x:x+lx] and s[y:y+yl] in O(log n). """
    l = min((self.lcp(x, y), lx, ly))
    if l == lx == ly: return 0
    if l == lx: return -1
    if l == ly: return 1
    return cmp(self.s[x + l], self.s[y + l])

  def count(self, x, l):
    """ count occurences of substring s[x:x+l] in O(log n). """
    n = len(self.s)
    cs = self.compareSubstrings
    lo = findFirst(0, n, lambda i: cs(self.sa[i], min(l, n - self.sa[i]), x, l) >= 0)
    hi = findFirst(0, n, lambda i: cs(self.sa[i], min(l, n - self.sa[i]), x, l) > 0)
    return hi - lo

  def debug(self):
    """ print the suffix array for debugging purposes. """
    for i, j in enumerate(self.sa):
      print str(i).ljust(4), self.s[j:], self.lcp(self.sa[i], self.sa[i-1]) if i >0 else "n/a"

def filterSublist(lst):
  splitter = "\x00"
  s = splitter.join(lst) + splitter
  sa = SuffixArray(s)
  res = []
  offset = 0
  for x in lst:
    if sa.count(offset, len(x)) == 1:
      res.append(x)
    offset += len(x) + 1
  return res

However, the interpretation overhead likely causes this to be slower than the O(n^2) approaches unless S is really large (in the order of 10^5 characters or more).

like image 64
Niklas B. Avatar answered Oct 19 '22 04:10

Niklas B.


You can build your problem in matrix form as:

import numpy as np

lst = np.array(["a", "abc", "b", "d", "xy", "xyz"], object)
out = np.zeros((len(lst), len(lst)), dtype=int)
for i in range(len(lst)):
    for j in range(len(lst)):
        out[i,j] = lst[i] in lst[j]

from where you get out as:

array([[1, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0],
       [0, 1, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 1, 1],
       [0, 0, 0, 0, 0, 1]])

then, the answer will be the indices of lst where the sum of òut along the columns is 1 (the string is only in itself):

lst[out.sum(axis=1)==1]
#array(['abc', 'd', 'xyz'], dtype=object)

EDIT: You can do it much more efficiently with:

from numpy.lib.stride_tricks import as_strided
from string import find

size = len(lst)
a = np.char.array(lst)
a2 = np.char.array(as_strided(a, shape=(size, size),
                                 strides=(a.strides[0], 0)))

out = find(a2, a)
out[out==0] = 1
out[out==-1] = 0
print a[out.sum(axis=0)==1]
# chararray(['abc', 'd', 'xyz'], dtype='|S3')

a[out.sum(axis=0)==1]
like image 21
Saullo G. P. Castro Avatar answered Oct 19 '22 03:10

Saullo G. P. Castro


Does the order matter? If not,

a = ["a", "abc", "b", "d", "xy", "xyz"]

a.sort(key=len, reverse=True)
n = len(a)

for i in range(n - 1):
    if a[i]:
        for j in range(i + 1, n):
            if a[j] in a[i]:
                a[j] = ''


print filter(len, a)  # ['abc', 'xyz', 'd']

Not very efficient, but simple.

like image 45
gog Avatar answered Oct 19 '22 03:10

gog