Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reverse Jenkins' one-at-a-time hash

How would I go about obtaining any possible string value that matches a returned hash?

I don't want to obtain the exact key that was used, just any key that when passed into the function, will return the same hash of the unknown key.

uint32_t jenkins_one_at_a_time_hash(const uint8_t* key, size_t length) {
      size_t i = 0;
      uint32_t hash = 0;
      while (i != length) {
        hash += key[i++];
        hash += hash << 10;
        hash ^= hash >> 6;
      }
      hash += hash << 3;
      hash ^= hash >> 11;
      hash += hash << 15;
      return hash;
    }

E.g. I pass key as "keynumber1", the function returns 0xA7AF2FFE. How would I find ANY string that can also be hashed into 0xA7AF2FFE.

like image 374
Joseph Jones Avatar asked Dec 18 '22 19:12

Joseph Jones


1 Answers

While the brute force method suggested by chux works fine as it is, we can in fact speed it up by a factor of up to 256 or so (and, in fact, a lot more if we use all the optimizations described below).

The key realization here is that all the operations used to compute the hash are reversible. (This is by design, since it ensures that e.g. appending the same suffix to all input strings won't increase the number of hash collisions.) Specifically:

  • The operation hash += hash << n is, of course, equivalent to hash *= (1 << n) + 1. We're working with 32-bit unsigned integers, so all these calculations are done modulo 232. To undo this operation, all we need to do is find the modular multiplicative inverse of (1 << n) + 1 = 2n + 1 modulo 232 and multiply hash by it.

    We can do this pretty easily e.g. with this Python script, based on this answer here on SO. As it turns out, the multiplicative inverses of 210 + 1, 23 + 1 and 215 + 1 are, in hex, 0xC00FFC01, 0x38E38E39 and 0x3FFF8001 respectively.

  • To find the inverse of hash ^= hash >> n for some constant n, first note that this operation leaves the highest n bits of hash entirely unchanged. The next lower n bits are simply XORed with the highest n bits, so for those, simply repeating the operation undoes it. Looks pretty simple so far, right?

    To recover the original values of the third highest group of n bits, we need to XOR them with the original values of the second highest n bits, which we can of course calculate by XORing the two highest groups of n bits as describe above. And so on.

    What this all boils down to is that the inverse operation to hash ^= hash >> n is:

    hash ^= (hash >> n) ^ (hash >> 2*n) ^ (hash >> 3*n) ^ (hash >> 4*n) ^ ...
    

    where, of course, we can cut off the series once the shift amount is equal or greater to the number of bits in the integers we're working with (i.e. 32, in this case). Alternatively, we could achieve the same result in multiple steps, doubling the shift amount each time until it exceeds the bitlength of the numbers we're working with, like this:

    hash ^= hash >> n;
    hash ^= hash >> 2*n;
    hash ^= hash >> 4*n;
    hash ^= hash >> 8*n;
    // etc.
    

    (The multiple step method scales better when n is small compared to the integer size, but for moderately large n, the single step method may suffer from fewer pipeline stalls on modern CPUs. It's hard to say which one is actually more efficient in any given situation without benchmarking them both, and the results may vary between compilers and CPU models. In any case, such micro-optimizations are mostly not worth worrying too much about.)

  • Finally, of course, the inverse of hash += key[i++] is simply hash -= key[--i].

All this means that, if we want to, we can run the hash in reverse like this:

uint32_t reverse_one_at_a_time_hash(const uint8_t* key, size_t length, uint32_t hash) {
  hash *= 0x3FFF8001;  // inverse of hash += hash << 15;
  hash ^= (hash >> 11) ^ (hash >> 22);
  hash *= 0x38E38E39;  // inverse of hash += hash << 3;
  size_t i = length;
  while (i > 0) {
    hash ^= (hash >> 6) ^ (hash >> 12) ^ (hash >> 18) ^ (hash >> 24) ^ (hash >> 30);
    hash *= 0xC00FFC01;  // inverse of hash += hash << 10;
    hash -= key[--i];
  }
  return hash;  // this should return 0 if the original hash was correct
}

Then calling, say, reverse_one_at_a_time_hash("keynumber1", 10, 0xA7AF2FFE) should return zero, as indeed it does.


OK, that's cool. But what good is this for finding preimages?

Well, for one thing, if we guess all but the first byte of the input, then we can set the first byte to zero and run the hash backwards over this input. At this point, there are two possible outcomes:

  1. If the running the hash backwards like this produces an output that is a valid input byte (i.e. no greater than 255, and possibly with other restrictions if you e.g. want all the input bytes to be printable ASCII), then we can set the first byte of the input to that value, and we're done!

  2. Conversely, if the result of running the hash backwards is not a valid input byte (e.g. if it's greater than 255), then we know that there's no first byte that could make the rest of the input hash to the output we want, and we'll need to try another guess instead.

Here's an example, which finds the same input as chux's code (but prints it as a quoted string, not as a little-endian int):

#define TARGET_HASH 0xA7AF2FFE
#define INPUT_LEN 4

int main() {
  uint8_t buf[INPUT_LEN+1];  // buffer for guessed input (and one more null byte at the end)
  for (int i = 0; i <= INPUT_LEN; i++) buf[i] = 0;

  do {
    uint32_t ch = reverse_one_at_a_time_hash(buf, INPUT_LEN, TARGET_HASH);

    if (ch <= 255) {
      buf[0] = ch;
      // print the input with unprintable chars nicely quoted
      printf("hash(\"");
      for (int i = 0; i < INPUT_LEN; i++) {
        if (buf[i] < 32 || buf[i] > 126 || buf[i] == '"' || buf[i] == '\\') printf("\\x%02X", buf[i]);
        else putchar(buf[i]);
      }
      printf("\") = 0x%08X\n", TARGET_HASH);
      return 0;
    }

    // increment buffer, starting from second byte
    for (int i = 1; ++buf[i] == 0; i++) /* nothing */;
  } while (buf[INPUT_LEN] == 0);

  printf("No matching input of %d bytes found for hash 0x%08X. :(", INPUT_LEN, TARGET_HASH);
  return 1;
}

And here's a version that restricts the input to printable ASCII (and outputs the five-byte string ^U_N.):

#define TARGET_HASH 0xA7AF2FFE
#define MIN_INPUT_CHAR ' '
#define MAX_INPUT_CHAR '~'
#define INPUT_LEN 5

int main() {
  uint8_t buf[INPUT_LEN+1];  // buffer for guessed input (and one more null byte at the end)
  buf[0] = buf[INPUT_LEN] = 0;
  for (int i = 1; i < INPUT_LEN; i++) buf[i] = MIN_INPUT_CHAR;

  do {
    uint32_t ch = reverse_one_at_a_time_hash(buf, INPUT_LEN, TARGET_HASH);
    if (ch >= MIN_INPUT_CHAR && ch <= MAX_INPUT_CHAR) {
      buf[0] = ch;
      printf("hash(\"%s\") = 0x%08X\n", buf, TARGET_HASH);
      return 0;
    }

    // increment buffer, starting from second byte, while keeping bytes within the valid range
    int i = 1;
    while (buf[i] >= MAX_INPUT_CHAR) buf[i++] = MIN_INPUT_CHAR;
    buf[i]++;
  } while (buf[INPUT_LEN] == 0);

  printf("No matching input of %d bytes found for hash 0x%08X. :(", INPUT_LEN, TARGET_HASH);
  return 1;
}

Of course, it's easy to modify this code to be even more restrictive about which input bytes to accept. For example, using the following settings:

#define TARGET_HASH 0xA7AF2FFE
#define MIN_INPUT_CHAR 'A'
#define MAX_INPUT_CHAR 'Z'
#define INPUT_LEN 7

produces (after a few seconds of computation) the preimage KQEJZVS.

Restricting the input range does make the code run slower, since the probability of the result of the backwards hash computation being a valid input byte is, of course, proportional to the number of possible valid bytes.


There are various ways in which this code could be made to run even faster. For example, we could combine the backwards hashing with a recursive search, so that we don't have to repeatedly hash the whole input string even if only one byte of it changes:

#define TARGET_HASH 0xA7AF2FFE
#define MIN_INPUT_CHAR 'A'
#define MAX_INPUT_CHAR 'Z'
#define INPUT_LEN 7

static bool find_preimage(uint32_t hash, uint8_t *buf, int depth) {
  // first invert the hash mixing step
  hash ^= (hash >> 6) ^ (hash >> 12) ^ (hash >> 18) ^ (hash >> 24) ^ (hash >> 30);
  hash *= 0xC00FFC01;  // inverse of hash += hash << 10;

  // then check if we're down to the first byte
  if (depth == 0) {
    bool found = (hash >= MIN_INPUT_CHAR && hash <= MAX_INPUT_CHAR);
    if (found) buf[0] = hash;
    return found;
  }

  // otherwise try all possible values for this byte
  for (uint32_t ch = MIN_INPUT_CHAR; ch <= MAX_INPUT_CHAR; ch++) {
    bool found = find_preimage(hash - ch, buf, depth - 1);
    if (found) { buf[depth] = ch; return true; }
  }
  return false;
}   

int main() {
  uint8_t buf[INPUT_LEN+1];  // buffer for results
  for (int i = 0; i <= INPUT_LEN; i++) buf[INPUT_LEN] = 0;

  // first undo the finalization step
  uint32_t hash = TARGET_HASH;
  hash *= 0x3FFF8001;  // inverse of hash += hash << 15;
  hash ^= (hash >> 11) ^ (hash >> 22);
  hash *= 0x38E38E39;  // inverse of hash += hash << 3;

  // then search recursively until we find a matching input
  bool found = find_preimage(hash, buf, INPUT_LEN - 1);
  if (found) {
    printf("hash(\"%s\") = 0x%08X\n", buf, TARGET_HASH);
  } else {
    printf("No matching input of %d bytes found for hash 0x%08X. :(", INPUT_LEN, TARGET_HASH);
  }
  return !found;
}

But wait, we're not done yet! Looking at the original code of the one-at-a-time hash, we can see that the value of hash after the first iteration of the loop will be ((c << 10) + c) ^ ((c << 4) + (c >> 6)), where c is the first byte of input. Since c is an eight-bit byte, this means that only the lowest 18 bytes of hash can be set after the first iteration.

If fact, if we calculate the value of hash after the first iteration for every possible value of the first byte c, we can see that hash never exceeds 1042 * c. (In fact, the maximum of the ratio hash / c is only 1041.015625 = 1041 + 2-6.) This means that, if M is the maximum possible value of a valid input byte, the value of hash after the first iteration cannot exceed 1042 * M. And adding in the next input byte only increases hash by at most M.

So we can speed up the code above significantly by adding the following shortcut check into find_preimage():

  // optimization: return early if no first two bytes can possibly match
  if (depth == 1 && hash > MAX_INPUT_CHAR * 1043) return false;

In fact, a similar argument can be used to show that, after processing the first two bytes, at most the lowest 28 bytes of hash can be set (and, more precisely, that the ratio of hash to the maximum input byte value is at most 1084744.46667). So we can extend the optimization above to cover the last three stages of the search by rewriting find_preimage() like this:

static bool find_preimage(uint32_t hash, uint8_t *buf, int depth) {
  // first invert the hash mixing step
  hash ^= (hash >> 6) ^ (hash >> 12) ^ (hash >> 18) ^ (hash >> 24) ^ (hash >> 30);
  hash *= 0xC00FFC01;  // inverse of hash += hash << 10;

  // for the lowest three levels, abort early if no solution is possible    
  switch (depth) {
    case 0:
      if (hash < MIN_INPUT_CHAR || hash > MAX_INPUT_CHAR) return false;
      buf[0] = hash;
      return true;
    case 1:
      if (hash > MAX_INPUT_CHAR * 1043) return false;
      else break;
    case 2:
      if (hash > MAX_INPUT_CHAR * 1084746) return false;
      else break;
  }

  // otherwise try all possible values for this byte
  for (uint32_t ch = MIN_INPUT_CHAR; ch <= MAX_INPUT_CHAR; ch++) {
    bool found = find_preimage(hash - ch, buf, depth - 1);
    if (found) { buf[depth] = ch; return true; }
  }
  return false;
}

For the example search for a seven byte all-uppercase preimage of the hash 0xA7AF2FFE, this further optimization cuts the running time down to just 0.075 seconds (as opposed to 0.148 seconds for the depth == 1 shortcut alone, 2.456 seconds for the recursive search with no shortcuts, and 15.489 seconds for the non-recursive search, as timed by TIO).

like image 80
Ilmari Karonen Avatar answered Jan 04 '23 22:01

Ilmari Karonen