Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can you explain theese disturbing anomalies of md5 and modulo?

Okay, the title is really very subjective. But thats just what the problem is to me.

The background is that I want to distribute hits of static web contents evenly about a defined number of caching servers. Also the delivery to clients should speed up because several domains are in use and requests are not blocking each other. I also don't need a classic load balancer but generate the right links right away in my html code.

I also want to ensure that the same url always gets served by the same server.

So I just defined a little function that returns the host to use by hashing the request url and calculates the modulo by the number of servers in use:

function pseudocode_statify($url) { // $url looks like /folder1/folder2/file.jpg
 return 'http://' . md5($url) % $num_of_servers .'.mydomain.com' . $url;
}

I first had something like hex decoding and substring to prevent overflow in place, but found out that it just works fine the way above.

However my problem is that if I run the following test script:

for($i=0;$i<100000;$i++) {
  $md5 = md5(uniqid($i).microtime().rand(1,999999999999));
  $result[$md5%2]++;
}

I expected an even distribution. meaning that $result[0] would be near the value of $result[1];

This was not the case.

Okay, this is nothing special sofar. I would have just accepted the fact that md5 is not as evenly distributed as i thought and would have gone vor another hashing algorithm like sha1 or something.

But I tried to reproduce the findings and found a pattern that I cannot explain.

The ratio was always about 2/1. In fact it was the ratio was always something like 1/2.16 to 1/2.17

Sample output of some runs of the script above:

output was generated by: echo "ratio: ".$result[0]/$result[1]."\n";

ratio: 2.1757121534504
ratio: 2.1729411578062
ratio: 2.1726559360393
ratio: 2.1676895664225
ratio: 2.1667416128848
ratio: 2.1667115284133
ratio: 2.1677791605385
ratio: 2.1658969579688
ratio: 2.1668508131769
ratio: 2.1689292821741

Now the weird thing was that the ratio of sums % 2 equaling 1 and sums % 2 equaling 0 sometimes alternated!

for($j = 0; $j<100;$j++) {
    for($i=0;$i<100000;$i++) {
      $md5 = md5(uniqid($i).microtime().rand(1,999999999999));
      $result[$md5%2]++;
    }
var_dump($result);
}

I ran the script from the command line two sperate times and aborted it after 3 runs and it produced theese two outputs:

joe@joe-laptop:/home/flimmit/httpdocs$ php test.php
PHP Notice:  Undefined variable: result in /home/flimmit/httpdocs/test.php on line 6
PHP Notice:  Undefined offset: 0 in /home/flimmit/httpdocs/test.php on line 6
PHP Notice:  Undefined offset: 1 in /home/flimmit/httpdocs/test.php on line 6
array(2) {
  [0]=>
  int(68223)
  [1]=>
  int(31777)
}
array(2) {
  [0]=>
  int(136384)
  [1]=>
  int(63616)
}
array(2) {
  [0]=>
  int(204498)
  [1]=>
  int(95502)
}
^C
joe@joe-laptop:/home/flimmit/httpdocs$ php test.php
PHP Notice:  Undefined variable: result in /home/flimmit/httpdocs/test.php on line 6
PHP Notice:  Undefined offset: 1 in /home/flimmit/httpdocs/test.php on line 6
PHP Notice:  Undefined offset: 0 in /home/flimmit/httpdocs/test.php on line 6
array(2) {
  [1]=>
  int(31612)
  [0]=>
  int(68388)
}
array(2) {
  [1]=>
  int(63318)
  [0]=>
  int(136682)
}
array(2) {
  [1]=>
  int(94954)
  [0]=>
  int(205046)
}
^C
joe@joe-laptop:/home/flimmit/httpdocs$ 

As you can see in the first one the first entry of results is always higher, in the second one its the other way round. same script.

Strange thing is that i can ONLY reproduce this behaviour when i run the script several times.

I wrote this small script to reproduce the "swapping" and generate enough measure data:

for($j = 0; $j<100;$j++) {
  for($i=0;$i<rand(1000,10000);$i++) {
    $md5 = md5(uniqid($i).microtime().rand(1,99999999));
    $result[$md5%2]++;
    }
    #var_dump($result);
    echo "ratio: ".$result[0]/$result[1]." ".(($result[0]<$result[1]) ? "A":"B")."\n";
    sleep(rand(2,5));
}

But here It only prints b, never A's. Which made me think there might be a semantic error in the script, but i didnt find any.

I am really stuck and this bothers me a lot.

So my questions:

  • Can you recommend any literature / weblinks were i could read about md5 a little bit deeper including distributions etc

  • Can you explain / reproduce the behaviour? Do I have an error here? (in fact thats very likely but i cant find it)

  • Can you recommend any other algorithm that would besuitable for my use case? It needs not be cryptographic or strong but fast, deterministic and evenly distributed.

like image 401
The Surrican Avatar asked Mar 29 '11 17:03

The Surrican


2 Answers

The md5() function returns a string, not an integer.

Which means that this string will be type-casted to an integer to do the modulo ; and as this string will contain characters in the 0-9A-F range, casted to an integer, you have :

  • 1 chance out of 16 of getting a 0
  • 9 chances out of 16 of getting between 1 and 9
  • 6 chances out of 16 of getting between A and F -- which will be casted to a 0


For example, this :

$a = md5('plop1');
var_dump($a, (int)$a);

$a = md5('plop2');
var_dump($a, (int)$a);

$a = md5('plop5');
var_dump($a, (int)$a);

Will get you the following output :

string 'ac4bf0e466417336599b72a8b2f595da' (length=32)
int 0

string 'ed91c463402dd797d0718350f5bd0acd' (length=32)
int 0

string '85782b3afb04072c1bf172a6a7e6bb5e' (length=32)
int 85782

I'll let you guess the possible impact this can have on the result of the modulo operator ;-)

like image 107
Pascal MARTIN Avatar answered Nov 09 '22 00:11

Pascal MARTIN


for ($i = 0; $i < 10; $i++) {
    srand(crc32('test_url1'));
    echo rand().'<br />';
}
for ($i = 0; $i < 10; $i++) {
    srand(crc32('test_url2'));
    echo rand().'<br />';
}

add a range to the rand function and you have your server value.

like image 21
dqhendricks Avatar answered Nov 08 '22 23:11

dqhendricks