Filter only digit sequences containing a given set of digits

Question

I have a large list of digit strings like this one. The individual strings are relatively short (say less than 50 digits).

data = [
  '300303334',
  '53210234',
  '123456789',
  '5374576807063874'
]

I need to find out a efficient data structure (speed first, memory second) and algorithm which returns only those strings that are composed of a given set of digits.

Example results:

filter(data, [0,3,4]) = ['300303334']
filter(data, [0,1,2,3,4,5]) = ['300303334', '53210234']

The data list will usually fit into memory.

Paul Hankin · Accepted Answer

For each digit, precompute a postings list that don't contain the digit.

postings = [[] for _ in xrange(10)]
for i, d in enumerate(data):
    for j in xrange(10):
        digit = str(j)
        if digit not in d:
            postings[j].append(i)

Now, to find all strings that contain, for example, just the digits [1, 3, 5] you can merge the postings lists for the other digits (ie: 0, 2, 4, 6, 7, 8, 9).

def intersect_postings(p0, p1):
    i0, i1 = next(p0), next(p1)
    while True:
         if i0 == i1:
            yield i0
            i0, i1 = next(p0), next(p1)
         elif i0 < i1: i0 = next(p0)
         else: i1 = next(p1)

def find_all(digits):
    p = None
    for d in xrange(10):
        if d not in digits:
            if p is None: p = iter(postings[d])
            else: p = intersect_postings(p, iter(postings[d]))
    return (data[i] for i in p) if p else iter(data)

print list(find_all([0, 3, 4]))
print list(find_all([0, 1, 2, 3, 4, 5]))

Filter only digit sequences containing a given set of digits

Tags:

performance

algorithm

lookup

data-structures

filtering

user323094

1 Answers

Paul Hankin

Recent Activity

Donate For Us

Filter only digit sequences containing a given set of digits

Tags:

performance

algorithm

lookup

data-structures

filtering

user323094

1 Answers

Paul Hankin

Related questions

Recent Activity

Donate For Us