I recently read Jeff's blog post entitled Speed Hashing, where amongst other things he mentions that you can hash things really fast by harnessing the power of your GPU.
I was wondering whether or not it was possible to harness the power of the GPU to hash things in Python (md5, sha-1 etc)?
I'm interested in this for trying to see how fast I can brute-force things (not real world stuff, from old leaked data dumps).
At the moment, I'm doing this sort of thing (simplified example):
from itertools import product
from hashlib import md5
hashes = ["some","hashes"]
chars = []
for i in range(97,123): # a-z only
chars.append(chr(i))
for i in range(1,6): # all combos of a-z, 1-5 chars
for c in product(chars,repeat=i):
s = ''.join(c)
if md5(s).hexdigest() in hashes:
print "Found",s
But I was wondering whether there was a way to speed that up using the GPU? I'm guessing I'm going to need a module that would serially generate hashes like this - does anyone know of one?
There are two obstacles:
Calling the program running on the GPU from Python, and exchanging data. There are several Python+CUDA projects which do that part:
With suitable searching you might find more.
Also Python GPU programming looks relevant
Then the Python program will load and invoke the GPU 'kernel' (the program created using technology from part 1 of this answer) using one of the technologies from part 2, or equivalent.
EDIT You might generate the entire set of 'brute force' values, and the md5 hashes on the GPU. Then just retrieve results using Python. This might be easier than generating the values in Python, passing them to the GPU, then getting back the md5's.
If I've understood, the program generates all 1 character, 2, 3, 4, 5, and 6 lower-case-letter strings, and generates a md5 hash, yes?
Edit2 - My Previous analysis was utterly wrong - I apologise
Edit3: Skimming Wikipedia MD5 it looks like calculating the MD5 for a constant length string (e.g. 6 ASCII characters) can be optimised.
According to Wikipedia's pseudo-code it is only 64 loops, with groups of 16 loop iterations using the same arithmetic. So, if the key is under 55 bytes, the core of the calculation could be 'unrolled' from:
for i from 0 to 63
if 0 ≤ i ≤ 15 then
f := (b and c) or ((not b) and d)
g := i
else if 16 ≤ i ≤ 31
f := (d and b) or ((not d) and c)
g := (5×i + 1) mod 16
else if 32 ≤ i ≤ 47
f := b xor c xor d
g := (3×i + 5) mod 16
else if 48 ≤ i ≤ 63
f := c xor (b or (not d))
g := (7×i) mod 16
temp := d
d := c
c := b
b := b + leftrotate((a + f + k[i] + w[g]) , r[i])
a := temp
end for
to:
// i == 0
f := (b and c) or ((not b) and d) // +4 ops
// g := i
temp := d
d := c
c := b
b := b + leftrotate((a + f + k[0] + w[0]) , r[0]) // 9 ops
a := temp
// i == 1
f := (b and c) or ((not b) and d)
// g := i
temp := d
d := c
c := b
b := b + leftrotate((a + f + k[1] + w[1]) , r[1])
a := temp
That unrolling causes some of the array indexing to be constant, which should allow a good GPU compiler to do even more constant propagation. This might give a significant improvement. Each step is approximately 9 operations and the compiler will need to shuffle 5 pieces of data, so roughly 14 operations/step * 64 steps, approximately 1000 operations.
Edit4:
Glerk! I've read more of the Wikipedia MD5 algorithm - MD5 is easier to attack than I'd realised. Only the first two loops of each group of 16 directly use the variable key string of 6 bytes, the rest of the string is constant. The rest of the algorithm is shuffling and bit-wise operations, which are likely amenable to very significant further optimisation. Only 2 of each 16 loops involves the key, then that might be up to 8x faster, and maybe more than 4x.
So rather than 1024 core GPU, running at 1GHz, giving 1024 hashes/microsecond, instead say 4096/microsecond or 8096/us = 4-8 hashes/nanosecond
There are approximately 27^6 keys = 387,420,489 keys and hence md5 hashes.
387,420,489 keys / 4-8/nanosecond approximately = 0.05 - 0.1 seconds
Communication between host and GPU, will be quite slow, but unlikely more than 100%.
So approximately between 0.1 seconds and 0.2 seconds.
An md5 hash is 16 bytes, so that would consume 6.2 GBytes if it were to be stored. On two modern GPUs, that would only require 2 transfers each, but would be a very significant overhead. If the hash values are saved to disk (even using SSD), or moved over 10Gbit Ethernet, the hash generation is swamped by the I/O time.
There are only 94 printable ASCII characters, so for every ASCII 6 character key:
94^6 = 689,869,781,056 keys / 4-8/nanosecond = 86-172 seconds
Oh my!-(
Long keys, and something better than MD5!
Maybe try to write a Python program to generate the optimal GPU algorithm?
Generate the text of the GPU 'kernel' by 'Unrolling' the loops within the Python program, and print the text of the straight-line calculation, with all the constants filled in.
Then try to figure out what the optimal sequence of instructions is to calculate the MD5 for each key-length. Using the unrolled program, try to track the operations on each bit, and the dependencies, then try to reassemble the bits & their operations into contiguous 32bit words and a new straight-line calculation. (To be fair, maybe the GPU compiler can do some of this anyway? Might be interesting to find out)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With