Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to use the GPU to accelerate hashing in Python?

Tags:

python

hash

gpu

I recently read Jeff's blog post entitled Speed Hashing, where amongst other things he mentions that you can hash things really fast by harnessing the power of your GPU.

I was wondering whether or not it was possible to harness the power of the GPU to hash things in Python (md5, sha-1 etc)?

I'm interested in this for trying to see how fast I can brute-force things (not real world stuff, from old leaked data dumps).

At the moment, I'm doing this sort of thing (simplified example):

from itertools import product
from hashlib import md5

hashes = ["some","hashes"]

chars = []
for i in range(97,123): # a-z only
    chars.append(chr(i))

for i in range(1,6): # all combos of a-z, 1-5 chars
    for c in product(chars,repeat=i):
       s = ''.join(c)
       if md5(s).hexdigest() in hashes:
           print "Found",s

But I was wondering whether there was a way to speed that up using the GPU? I'm guessing I'm going to need a module that would serially generate hashes like this - does anyone know of one?

like image 549
Alex Coplan Avatar asked Apr 06 '12 18:04

Alex Coplan


1 Answers

There are two obstacles:

  1. Writing a program to execute on the GPU. AFAIK, there is no mechanism currently available to convert an Python program to the code executed by a GPU. So unless you can find what you need (which might be possible, as it looks like a reasonably common use case), then you are going to have to do that using one of the GPU programming languages (CUDA, OpenCL, Haskell, ...)
  2. Calling the program running on the GPU from Python, and exchanging data. There are several Python+CUDA projects which do that part:

    • http://mathema.tician.de/software/pycuda
    • http://code.google.com/p/pystream/
    • https://launchpad.net/python-cuda

    With suitable searching you might find more.

    Also Python GPU programming looks relevant

    Then the Python program will load and invoke the GPU 'kernel' (the program created using technology from part 1 of this answer) using one of the technologies from part 2, or equivalent.

EDIT You might generate the entire set of 'brute force' values, and the md5 hashes on the GPU. Then just retrieve results using Python. This might be easier than generating the values in Python, passing them to the GPU, then getting back the md5's.

If I've understood, the program generates all 1 character, 2, 3, 4, 5, and 6 lower-case-letter strings, and generates a md5 hash, yes?


Edit2 - My Previous analysis was utterly wrong - I apologise


Edit3: Skimming Wikipedia MD5 it looks like calculating the MD5 for a constant length string (e.g. 6 ASCII characters) can be optimised.

According to Wikipedia's pseudo-code it is only 64 loops, with groups of 16 loop iterations using the same arithmetic. So, if the key is under 55 bytes, the core of the calculation could be 'unrolled' from:

for i from 0 to 63
    if 0 ≤ i ≤ 15 then
        f := (b and c) or ((not b) and d)
        g := i
    else if 16 ≤ i ≤ 31
        f := (d and b) or ((not d) and c)
        g := (5×i + 1) mod 16
    else if 32 ≤ i ≤ 47
        f := b xor c xor d
        g := (3×i + 5) mod 16
    else if 48 ≤ i ≤ 63
        f := c xor (b or (not d))
        g := (7×i) mod 16
    temp := d
    d := c
    c := b
    b := b + leftrotate((a + f + k[i] + w[g]) , r[i])
    a := temp
end for

to:

// i == 0
f := (b and c) or ((not b) and d)   // +4 ops
// g := i
temp := d
d := c
c := b
b := b + leftrotate((a + f + k[0] + w[0]) , r[0])  // 9 ops
a := temp
// i == 1
f := (b and c) or ((not b) and d)
// g := i
temp := d
d := c
c := b
b := b + leftrotate((a + f + k[1] + w[1]) , r[1])
a := temp

That unrolling causes some of the array indexing to be constant, which should allow a good GPU compiler to do even more constant propagation. This might give a significant improvement. Each step is approximately 9 operations and the compiler will need to shuffle 5 pieces of data, so roughly 14 operations/step * 64 steps, approximately 1000 operations.

Edit4:
Glerk! I've read more of the Wikipedia MD5 algorithm - MD5 is easier to attack than I'd realised. Only the first two loops of each group of 16 directly use the variable key string of 6 bytes, the rest of the string is constant. The rest of the algorithm is shuffling and bit-wise operations, which are likely amenable to very significant further optimisation. Only 2 of each 16 loops involves the key, then that might be up to 8x faster, and maybe more than 4x.

So rather than 1024 core GPU, running at 1GHz, giving 1024 hashes/microsecond, instead say 4096/microsecond or 8096/us = 4-8 hashes/nanosecond

There are approximately 27^6 keys = 387,420,489 keys and hence md5 hashes.

387,420,489 keys / 4-8/nanosecond approximately = 0.05 - 0.1 seconds

Communication between host and GPU, will be quite slow, but unlikely more than 100%.

So approximately between 0.1 seconds and 0.2 seconds.

An md5 hash is 16 bytes, so that would consume 6.2 GBytes if it were to be stored. On two modern GPUs, that would only require 2 transfers each, but would be a very significant overhead. If the hash values are saved to disk (even using SSD), or moved over 10Gbit Ethernet, the hash generation is swamped by the I/O time.

There are only 94 printable ASCII characters, so for every ASCII 6 character key:

94^6 = 689,869,781,056 keys / 4-8/nanosecond = 86-172 seconds

Oh my!-(

Long keys, and something better than MD5!

Maybe try to write a Python program to generate the optimal GPU algorithm?

Generate the text of the GPU 'kernel' by 'Unrolling' the loops within the Python program, and print the text of the straight-line calculation, with all the constants filled in.

Then try to figure out what the optimal sequence of instructions is to calculate the MD5 for each key-length. Using the unrolled program, try to track the operations on each bit, and the dependencies, then try to reassemble the bits & their operations into contiguous 32bit words and a new straight-line calculation. (To be fair, maybe the GPU compiler can do some of this anyway? Might be interesting to find out)

like image 193
gbulmer Avatar answered Sep 21 '22 20:09

gbulmer