Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing array merge operation

I would appreciate any help, given.

I have 7 separate arrays with approx. 90,000 numbers in each array (let's call them arrays1-arrays7). There are no duplicate numbers within each array itself. BUT, there can be duplicates between the arrays. for example, array2 has no duplicates but it is possible to have numbers in common with arrays3 and arrays4.

The Problem: I am trying to identify all of the numbers that are duplicated 3 times once all 7 arrays are merged.

I must do this calculation 1000 times and it takes 15 mins but that is not ok because I have to run it 40 times -- The code:

if you know of another language that is best suited for this type of calculation please let me know. any extension suggestions such as redis or gearman are helpful.

for($kj=1; $kj<=1000; $kj++)
    {
$result=array_merge($files_array1,$files_array2,$files_array3,$files_array4,$files_array5,$files_array6,$files_array7);

$result=array_count_values($result);

$fp_lines = fopen("equalTo3.txt", "w");

foreach($result as $key => $val)
{
    if($result[$key]==3)
    {
    fwrite($fp_lines, $key."\r\n");
    }
}
fclose($fp_lines);
}

i have also tried the code below with strings but the array_map call and the array_count values call take 17 mins:

for($kj=1; $kj<=1000; $kj++)
    {

$result='';

for ($ii = 0; $ii< 7; $ii++) {
    $result .= $files_array[$hello_won[$ii]].'\r\n';
}

$result2=explode("\n",$result);//5mins
$result2=array_map("trim",$result2);//11mins
$result2=array_count_values($result2);//4-6mins

$fp_lines = fopen("equalTo3.txt", "w");

foreach($result2 as $key => $val)
{

    if($result2[$key]==3)
    {
    fwrite($fp_lines, $key."\r\n");
    }
}
fclose($fp_lines);

unset($result2);
like image 631
user3152377 Avatar asked Apr 28 '14 18:04

user3152377


3 Answers

array_merge() is significantly slower with more elements in the array because (from php.net):

If the input arrays have the same string keys, then the later value for that key will overwrite the previous one. If, however, the arrays contain numeric keys, the later value will not overwrite the original value, but will be appended.

Values in the input array with numeric keys will be renumbered with incrementing keys starting from zero in the result array.

So this function is actually making some conditional statements. You can replace array merge with normal adding, consisting of the loop (foreach or any other) and the [] operator. You can write a function imitating array_merge, like(using reference to not copy the array..):

function imitateMerge(&$array1, &$array2) {
    foreach($array2 as $i) {
        $array1[] = $i;
    }
}

And you will see the increase of speed really hard.

like image 83
sunshinejr Avatar answered Sep 27 '22 22:09

sunshinejr


I suggest replacing

foreach($result as $key => $val)
{
    if($result[$key]==3)
    {
    fwrite($fp_lines, $key."\r\n");
    }
}

With something like

$res = array_keys(array_filter($result, function($val){return $val == 3;}));
fwrite($fp_lines, implode("\r\n", $res));
like image 20
Ziumin Avatar answered Sep 27 '22 21:09

Ziumin


This is probably all wrong, look at last edit

I also think array_merge is the problem, but my suggestion would be to implement a function counting the values in several array directly instead of merging first. This depends a little bit on how much overlap you have in your arrays. If the overlap is very small then this might not be much faster then merging, but with significant overlap (rand(0, 200000) to fill arrays when i tried) this will be much faster.

function arrValues($arrs) {
    $values = array();

    foreach($arrs as $arr) {
        foreach($arr as $key => $val) {
            if(array_key_exists($key, $values)) {
                $values[$val]++;
            } else {
                $values[$val] = 1;
            }
        }
    }
    return $values;
}

var_dump(arrValues(array
    ($files_array1
    ,$files_array2
    ,$files_array3
    ,$files_array4
    ,$files_array5
    ,$files_array6
    ,$files_array7
    )));

The computation takes about 0.5s on my machine, then another 2s for printing the stuff.

-edit-

It's also not clear to me why you do the same thing 1000 times? Are the arrays different each time or something? Saying a bit about the reason might give people additional ideas...

-edit again-

After some more exploration I don't believe array_merge is at fault any more. You don't have enough overlap to benefit that much from counting everything directly. Have you investigated available memory on your machine? For me merging 7 arrays with 90k elements in each takes ~250M. If you have allowed php to use this much memory, which I assume you have since you do not get any allocation errors then maybe the problem is that the memory is simply not available and you get a lot of page faults? If this is not the problem then on what kind of machine and what php version are you using? I've tested your original code on 5.5 and 5.4 and fixing the memory issue it also runs in about 0.5s. That is per iteration mind you. Now if you do this a 1000 times in the same php script then it will take a while. Even more so considering that you allocate all this memory each time.

I believe you really should consider putting stuff in a database. Given your numbers it seems you have ~500M rows in total. That is an awful lot to handle in php. A database makes it easy.

like image 33
monocell Avatar answered Sep 27 '22 22:09

monocell