I have a large list of digit strings like this one. The individual strings are relatively short (say less than 50 digits).
data = [
'300303334',
'53210234',
'123456789',
'5374576807063874'
]
I need to find out a efficient data structure (speed first, memory second) and algorithm which returns only those strings that are composed of a given set of digits.
Example results:
filter(data, [0,3,4]) = ['300303334']
filter(data, [0,1,2,3,4,5]) = ['300303334', '53210234']
The data list will usually fit into memory.
For each digit, precompute a postings list that don't contain the digit.
postings = [[] for _ in xrange(10)]
for i, d in enumerate(data):
for j in xrange(10):
digit = str(j)
if digit not in d:
postings[j].append(i)
Now, to find all strings that contain, for example, just the digits [1, 3, 5] you can merge the postings lists for the other digits (ie: 0, 2, 4, 6, 7, 8, 9).
def intersect_postings(p0, p1):
i0, i1 = next(p0), next(p1)
while True:
if i0 == i1:
yield i0
i0, i1 = next(p0), next(p1)
elif i0 < i1: i0 = next(p0)
else: i1 = next(p1)
def find_all(digits):
p = None
for d in xrange(10):
if d not in digits:
if p is None: p = iter(postings[d])
else: p = intersect_postings(p, iter(postings[d]))
return (data[i] for i in p) if p else iter(data)
print list(find_all([0, 3, 4]))
print list(find_all([0, 1, 2, 3, 4, 5]))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With