Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to search 1GB+ a string of data for the first occurrence of a pattern in Python

There's a 1 Gigabyte string of arbitrary data which you can assume to be equivalent to something like:

1_gb_string=os.urandom(1*gigabyte)

We will be searching this string, 1_gb_string, for an infinite number of fixed width, 1 kilobyte patterns, 1_kb_pattern. Every time we search the pattern will be different. So caching opportunities are not apparent. The same 1 gigabyte string will be searched over and over. Here is a simple generator to describe what's happening:

def findit(1_gb_string):
    1_kb_pattern=get_next_pattern()
    yield 1_gb_string.find(1_kb_pattern)

Note that only the first occurrence of the pattern needs to be found. After that, no other major processing should be done.

What can I use that's faster than python's bultin find for matching 1KB patterns against 1GB or greater data strings?

(I am already aware of how to split up the string and searching it in parallel, so you can disregard that basic optimization.)

Update: Please bound memory requirements to 16GB.

like image 843
user213060 Avatar asked Nov 17 '09 17:11

user213060


3 Answers

As you clarify that long-ish preprocessing is acceptable, I'd suggest a variant of Rabin-Karp: "an algorithm of choice for multiple pattern search", as wikipedia puts it.

Define a "rolling hash" function, i.e., one such that, when you know the hash for haystack[x:x+N], computing the hash for haystack[x+1:x+N+1] is O(1). (Normal hashing functions such as Python's built-in hash do not have this property, which is why you have to write your own, otherwise the preprocessing becomes exhaustingly long rather than merely long-ish;-). A polynomial approach is fruitful, and you could use, say, 30-bit hash results (by masking if needed, i.e., you can do the computation w/more precision and just store the masked 30 bits of choice). Let's call this rolling hash function RH for clarity.

So, compute 1G of RH results as you roll along the haystack 1GB string; if you just stored these it would give you an array H of 1G 30-bit values (4GB) mapping index-in-haystack->RH value. But you want the reverse mapping, so use instead an array A of 2**30 entries (1G entries) that for each RH value gives you all the indices of interest in the haystack (indices at which that RH value occurs); for each entry you store the index of the first possibly-interesting haystack index into another array B of 1G indices into the haystack which is ordered to keep all indices into haystack with identical RH values ("collisions" in hashing terms) adjacent. H, A and B both have 1G entries of 4 bytes each, so 12GB total.

Now for each incoming 1K needle, compute its RH, call it k, and use it as an index into A; A[k] gives you the first index b into B at which it's worth comparing. So, do:

ib = A[k]
b = B[ib]
while b < len(haystack) - 1024:
  if H[b] != k: return "not found"
  if needle == haystack[b:b+1024]: return "found at", b
  ib += 1
  b = B[ib]

with a good RH you should have few collisions, so the while should execute very few times until returning one way or another. So each needle-search should be really really fast.

like image 121
Alex Martelli Avatar answered Nov 08 '22 07:11

Alex Martelli


There are a number of string matching algorithms use in the field of genetics to find substrings. You might try this paper or this paper

like image 28
stimms Avatar answered Nov 08 '22 09:11

stimms


As far as I know, standard find algorithm is naive algorithm with complexity about n*m comparisons, because it checks patterns against every possible offset. There are some more effective algoithms, requiring about n+m comparisons. If your string is not a natural language string, you can try Knuth–Morris–Pratt algorithm . Boyer–Moore search algorithm is fast and simple enough too.

like image 1
Darth Beleg Avatar answered Nov 08 '22 09:11

Darth Beleg