I found variants of this idea, but none that could get me (very new to python) to where I need to be.
Here is the scenario:
hashfile.txt
consisting of unique strings all on separate lines.addresses.txt
fileoutfile.txt
My current code has been optimized to the best of my ability, but only gets around 150 lines/second. Considering I have over 1.5 billion lines in my hashfile.txt
, any optimization would help.
fin = 'hashed.txt'
nonzeros = open('addrOnly.txt', 'r')
fout = open('hits.txt', 'w')
lines = nonzeros.read()
i = 0
count = 0
with open(fin, 'r') as f:
for privkey in f:
address = privkey.split(", ")[0]
if address in lines:
fout.write(privkey)
i = i+1
if i%100 == 0:
count = count + 100
print "Passed: " + str(count)
What you're looking to implement is probably the Rabin-Karp string search. It's highly efficient when you're searching for multiple strings simultaneously in some corpus.
More information about python implementations in this post. python efficient substring search
Since you're searching for multiple address at once, you would probably want to hash the entries in addresses.txt
and compare them against the Rabin-Karp hash all at once, for each iteration. Read up more about the rolling hash in Rabin-Karp and you'll see how this works.
Since Rabin-Karp requires all patterns to be the same length; in practice all the addresses are probably of some non-negligible length, you can truncate them all down to the same (not too short) length and use the prefix to hash. Also, you might want to modify the Rabin-Karp hash to be invariant to whitespace and small differences in how addresses are formatted, and also define a custom string comparator similarly that will confirm matches.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With