Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Searching for string in massive files efficiently

I found variants of this idea, but none that could get me (very new to python) to where I need to be.

Here is the scenario:

  1. I have one massive, 27 gig hashfile.txt consisting of unique strings all on separate lines.
  2. I need to parse this file line by line searching for a match in another, not-so-large (~800mb) addresses.txt file
  3. When a match is found, this needs to be written to outfile.txt

My current code has been optimized to the best of my ability, but only gets around 150 lines/second. Considering I have over 1.5 billion lines in my hashfile.txt, any optimization would help.

fin = 'hashed.txt'
nonzeros = open('addrOnly.txt', 'r')
fout = open('hits.txt', 'w')
lines = nonzeros.read()
i = 0
count = 0

with open(fin, 'r') as f:
    for privkey in f:
            address = privkey.split(", ")[0]
            if address in lines:
                    fout.write(privkey)
            i = i+1
            if i%100 == 0:
                    count = count + 100
                    print "Passed: " + str(count)
like image 856
dannyincolor Avatar asked Jan 14 '23 23:01

dannyincolor


1 Answers

What you're looking to implement is probably the Rabin-Karp string search. It's highly efficient when you're searching for multiple strings simultaneously in some corpus.

More information about python implementations in this post. python efficient substring search

Since you're searching for multiple address at once, you would probably want to hash the entries in addresses.txt and compare them against the Rabin-Karp hash all at once, for each iteration. Read up more about the rolling hash in Rabin-Karp and you'll see how this works.

Since Rabin-Karp requires all patterns to be the same length; in practice all the addresses are probably of some non-negligible length, you can truncate them all down to the same (not too short) length and use the prefix to hash. Also, you might want to modify the Rabin-Karp hash to be invariant to whitespace and small differences in how addresses are formatted, and also define a custom string comparator similarly that will confirm matches.

like image 67
Andrew Mao Avatar answered Jan 17 '23 15:01

Andrew Mao