I know the length and restricted character set of some CRC32 hashes. Does this make it easier to reverse them?

Question

We have a bunch of CRC32 hashes which would be really nice to know the input of. Some of them are short enough that brute-forcing them is feasible; others are not. The CRC32 algorithm being used is equal to the one in Python's binascii (its implementation is spelled out at https://rosettacode.org/wiki/CRC-32#Python).

I've read these pages:

http://www.danielvik.com/2010/10/calculating-reverse-crc.html
http://www.danielvik.com/2013/07/rewinding-crc-calculating-crc-backwards.html

...and it seems to me that there's something hidden somewhere that can reduce the amount of effect to reverse these things that I just can't puzzle out.

The main reason I think we can do better than full brute force is that we know two things about the hash inputs:

We know the length of every input. I believe this means that as soon as we find a guess that equals length and has a reversed CRC value of 0, that must be correct (might not help much though). Maybe there's some other property of length caused by the algorithm that could cut down on effort that I don't see.
We also know that every input has a restricted character set of [A-Za-z_0-9] (only letters, numbers, and underscores). In addition, we know that numbers are rare, and A-Z do not appear to ever be mixed with a-z, so we can often get away with just [a-z_] in 90% of cases.
- Also, the majority of these are snake_case English words/phrases (e.g. "weight" or "turn_around"). So we can also filter out any guesses that contain "qxp" or such.

Since the above links discuss how you can add any 4 chars to an input and leave the hash unchanged, one idea I thought of is to brute force the last 4 chars (most of which are invalid because they're not in the restricted charset), filter out the ones that are clearly not right (because of illegal 3-letter English combos and such), and come up with a "short"list of potentially valid last 4 chars. Then we repeat until the whole thing has been whittled down, which should be faster than pure brute force. But I can't find the leap in logic to figure out how.

If I'm going off in the wrong direction to solve this, that would be good to know as well.

Renat · Accepted Answer

the majority of these are snake_case English words/phrases (e.g. "weight" or "turn_around") - These ones could be brute-forced by using dictionary (e.g from this question) and utilities. Assuming total amount of English words up to 1M, trying (1M)^2 CRC32 looks feasible and quite fast.

Given a text file with all dictionary words, enumerating all word_word and comparing with CRC hashes could be done with e.g. Hashcat tool with instruction from here as:

hashcat64.bin -m 11500 -a 1 -j '$_' hashes.txt dictionary.txt dictionary.txt

and just testing against each word in dictionary as:

hashcat64.bin -m 11500 -a 0 hashes.txt dictionary.txt

For phrases longer than 2 words, each phrase length would be an individual case as e.g. Hashcat has no option to permutate 3 or more dictionaries (ref) . For 3-words phrases you need to generate a file with 2-words combination first (e.g. as here but in form of {}_{} ). Then combine it with 1-word dictionary: hashcat64.bin -m 11500 -a 1 -j '$_' hashes.txt two_words_dictionary.txt dictionary.txt . Next, 4-words phrases could be brute forced as: hashcat64.bin -m 11500 -a 1 -j '$_' hashes.txt two_words_dictionary.txt two_words_dictionary.txt , etc. (Another option would be to pipe combinations to Hashcat as combine_scrip ... | hashcat64.bin -m 11500 -a 0 hashes.txt but as CRC32 is very fast to check, pipe would be a bottleneck here, using dictionary files would be much faster than piping)

Of course n-words permutation increases complexity exponentially over n (with huge base). But as dictionary is limited to some subset instead of all English words, it depends on the size of dictionary how deep is practical to go with brute force.

mnz · Answer

Let me take another direction on the question and talk about the nature of CRC. As you know Cyclic Redundancy Check is something calculated by dividing the message (considered as a polynomial in GF(2)) by some primitive value, and it is by nature linear (a concept borrowed from coding theory). Let me explain what I mean by linear. Let's assume we have three messages A, B, C. then if we have

CRC(A) = CRC(B)

then we can say

CRC(A^C)=CRC(B^C)

(meaning that CRC will be changed based on XOR). Note that CRC is not a hash and its behaviour can be predicted. So you don't need a complicated tool like hashcat if your space is too big.

So theoretically you can find the linear space that CRC(x) = b, by setting

x = b0 + nullspace.

b0 is some string that satisfies

CRC(b0) = expectedCRC.

(Another note, because these systems usually have the initial condition and final xor implemented in them and it means that CRC(0)!=0).

Then you can reorder the nullspace to be localized. And then knowing your space to contain only ASCII characters and conditional sequence of characters you can search your space easily.

knowing that your space is pow(2,32) and your possible input is about pow(10,12) I would say there are too many texts that map into the same CRC.

I know the length and restricted character set of some CRC32 hashes. Does this make it easier to reverse them?

Tags:

Toomai

2 Answers

Renat

mnz

Recent Activity

Donate For Us

I know the length and restricted character set of some CRC32 hashes. Does this make it easier to reverse them?

Tags:

Toomai

2 Answers

Renat

mnz

Related questions

Recent Activity

Donate For Us