We have a bunch of CRC32 hashes which would be really nice to know the input of. Some of them are short enough that brute-forcing them is feasible; others are not. The CRC32 algorithm being used is equal to the one in Python's binascii
(its implementation is spelled out at https://rosettacode.org/wiki/CRC-32#Python).
I've read these pages:
...and it seems to me that there's something hidden somewhere that can reduce the amount of effect to reverse these things that I just can't puzzle out.
The main reason I think we can do better than full brute force is that we know two things about the hash inputs:
[A-Za-z_0-9]
(only letters, numbers, and underscores). In addition, we know that numbers are rare, and A-Z
do not appear to ever be mixed with a-z
, so we can often get away with just [a-z_]
in 90% of cases.
Since the above links discuss how you can add any 4 chars to an input and leave the hash unchanged, one idea I thought of is to brute force the last 4 chars (most of which are invalid because they're not in the restricted charset), filter out the ones that are clearly not right (because of illegal 3-letter English combos and such), and come up with a "short"list of potentially valid last 4 chars. Then we repeat until the whole thing has been whittled down, which should be faster than pure brute force. But I can't find the leap in logic to figure out how.
If I'm going off in the wrong direction to solve this, that would be good to know as well.
the majority of these are snake_case English words/phrases (e.g. "weight" or "turn_around")
- These ones could be brute-forced by using dictionary (e.g from this question) and utilities. Assuming total amount of English words up to 1M
, trying (1M)^2
CRC32 looks feasible and quite fast.
Given a text file with all dictionary words, enumerating all word_word
and comparing with CRC
hashes could be done with e.g. Hashcat
tool with instruction from here as:
hashcat64.bin -m 11500 -a 1 -j '$_' hashes.txt dictionary.txt dictionary.txt
and just testing against each word in dictionary as:
hashcat64.bin -m 11500 -a 0 hashes.txt dictionary.txt
For phrases longer than 2 words, each phrase length would be an individual case as e.g. Hashcat
has no option to permutate 3 or more dictionaries (ref) . For 3-words phrases you need to generate a file with 2-words combination first (e.g. as here but in form of {}_{}
). Then combine it with 1-word dictionary: hashcat64.bin -m 11500 -a 1 -j '$_' hashes.txt two_words_dictionary.txt dictionary.txt
. Next, 4-words phrases could be brute forced as: hashcat64.bin -m 11500 -a 1 -j '$_' hashes.txt two_words_dictionary.txt two_words_dictionary.txt
, etc. (Another option would be to pipe combinations to Hashcat
as combine_scrip ... | hashcat64.bin -m 11500 -a 0 hashes.txt
but as CRC32
is very fast to check, pipe would be a bottleneck here, using dictionary files would be much faster than piping)
Of course n-words permutation increases complexity exponentially over n
(with huge base). But as dictionary is limited to some subset instead of all English words, it depends on the size of dictionary how deep is practical to go with brute force.
Let me take another direction on the question and talk about the nature of CRC. As you know Cyclic Redundancy Check is something calculated by dividing the message (considered as a polynomial in GF(2)) by some primitive value, and it is by nature linear (a concept borrowed from coding theory). Let me explain what I mean by linear. Let's assume we have three messages A, B, C. then if we have
CRC(A) = CRC(B)
then we can say
CRC(A^C)=CRC(B^C)
(meaning that CRC will be changed based on XOR). Note that CRC is not a hash and its behaviour can be predicted. So you don't need a complicated tool like hashcat if your space is too big.
So theoretically you can find the linear space that CRC(x) = b, by setting
x = b0 + nullspace.
b0 is some string that satisfies
CRC(b0) = expectedCRC.
(Another note, because these systems usually have the initial condition and final xor implemented in them and it means that CRC(0)!=0).
Then you can reorder the nullspace to be localized. And then knowing your space to contain only ASCII characters and conditional sequence of characters you can search your space easily.
knowing that your space is pow(2,32) and your possible input is about pow(10,12) I would say there are too many texts that map into the same CRC.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With