Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Array collisions in php

Small remark

Reading about max_input_vars variable made me to read a lot about PHP's internals for handling arrays. This is not really a question, but rather answering my own question "why do we really need this max_input_var". It is not localized, and actually related to a lot of other programming languages and not only php.

A problem:

Compare these two small php scripts:

$data = array();
for ($key = 0; $key <= 1073709056; $key += 32767){
    $data[$key] = 0;
}

Can check it here. Everything normal, nothing unexpected. Execution time is close to 0.

And this mostly identical (difference is in 1)

$data = array();
for ($key = 0; $key <= 1073709056; $key += 32768){
    $data[$key] = 0;
}

Check it here. Nothing is normal, everything is unexpected. You exceeded execution time. So it is at least 3000 times slower!

The question is why does it happen?

I posted it here together with an answer because this vastly improved my knowledge about php internals and I learned new things about security.

like image 750
Salvador Dali Avatar asked Apr 02 '14 23:04

Salvador Dali


2 Answers

The problem is not in the loop, the problem is with how PHP and many other languages (Java, Python, ASP.Net) are storing key/value pairs in hash data structures. PHP uses hash table to store arrays (which makes them theoretically very fast for storing and retrieving data from that array O(1)). The problem arise when more than one values get mapped to the same key thus creating hash collisions. Inserting element into such a key becomes more expensive O(n) and therefore inserting n keys jumps from O(n) to O(n^2).

And this is exactly what goes on here. When the number changes from 32767 to 32768 it changes keys from no collisions to everything collides to the same key.

This is the case, because of the way it php arrays are implemented in C. Array is of the size of the power of 2. (Array of 9 and 15 elements will be allocated with array of size 16). Also if the array key is an integer, the hash will be an integer with a mask on top of it. The mask is size of the array - 1 in binary. This means that if someone will try to insert the following keys in the associative array 0, 32, 64, 128, 256, ... and so on, they all will be mapped to the same key and thus the hash will have a linked list. The above example creates exactly this.

This requires a lot of CPU to process and therefore you see huge time increase. What this means is that devs should be really careful when they accept some data from outside that they will be parsed into array (people can craft the data easily and DOS the server). This data can be $_GET, $_POST requests (this is why you can limit the number with max_input_vars), XML, JSON.

Here are the resources I used to learn about these things:

  • res 1
  • res 2
  • res 3
  • res 4
  • res 5
like image 139
Salvador Dali Avatar answered Oct 14 '22 06:10

Salvador Dali


I don't know anything about php specifically but 32767 would be the max value of a 2 byte number. Increasing it to 32768 would make the use of a 3 byte number (which is never used so it'll be 4 bytes) necessary which would in turn make everything slower.

like image 31
tvervack Avatar answered Oct 14 '22 05:10

tvervack