Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to find differences between two large arrays in PHP

I have 2 very large arrays (of size ~2,500,000). I need to find difference between these arrays. By difference I mean I need a resultant array with values that are in array 1 but not in array 2. I have used array_diff() but it takes more than half an hour!

The first array is coming from one DB and second array from another db. They aren't on the same Database server. The arrays aren't of same size. I am dealing with huge number of mobile numbers. I need to find out those mobile numbers which are in one list, but aren't in another list

arrays are normal arrays with numeric keys. Diff code is as follows:

$numbers_list = array_diff($numbers_list, $some_other_list);

Is there a better way of doing this? Please help.

like image 778
Amar Avatar asked Jan 11 '12 21:01

Amar


2 Answers

This is the simple algorithm.

  1. Flip 1st array. Values will become keys. So repeated values will be discarded.
  2. Flip 2nd array (optional)
  3. Check for each element in 2nd array if it exists in 1st array.

As you are working with very large arrays, it'll consume a lot of memory.

Here is my implementation,

$a = file("l.a"); // l.a is a file contains 2,500,000 lines
$b = file("l.b");

function large_array_diff($b, $a){
    // Flipping 
    $at = array_flip($a);
    $bt = array_flip($b); 
    // checking
    $d = array_diff_key($bt, $at);

    return array_keys($d);   
}

I ran it using 4G memory limit. 3G also works. Just tested.

$ time php -d memory_limit=4G diff_la.php

It took about 11 seconds!.

real    0m10.612s
user    0m8.940s
sys     0m1.460s

UPDATE

Following code runs 2x faster than large_array_diff function stated above.

function flip_isset_diff($b, $a) {
    $at = array_flip($a);
    $d = array();
    foreach ($b as $i)
        if (!isset($at[$i])) 
            $d[] = $i;

    return $d;
}

Because it does not call array_flip (1 time), array_diff_key and array_keys. Lots of CPU cycles are saved due to this.

like image 71
Shiplu Mokaddim Avatar answered Oct 01 '22 17:10

Shiplu Mokaddim


OK, updated, casting to string for good measure (it makes a LOT of difference if we could use integers instead of strings...):

<?php
$start = microtime(true);
echo 'creating' . PHP_EOL;
$array1 = array();
$array2 = array();
for ($i = 0;$i < 2000000;$i++) {
    $array1[] = (string)rand(100000000, 999999999);
    $array2[] = (string)rand(100000000, 999999999);
}
echo (microtime(true) - $start) . PHP_EOL;
$start = microtime(true);
echo 'key diff flip' . PHP_EOL;
$array1f = array_flip($array1);
$array2f = array_flip($array2);
echo (microtime(true) - $start) . PHP_EOL;
$start = microtime(true);
echo 'key diff' . PHP_EOL;
$diff = array_diff_key($array1f, $array2f);
echo (microtime(true) - $start) . PHP_EOL;
$start = microtime(true);
echo 'sorting' . PHP_EOL;
$start = microtime(true);
sort($array1);
sort($array2);
$notin2 = array();
reset($array2);
echo (microtime(true) - $start) . PHP_EOL;
echo 'comparing' . PHP_EOL;
$start = microtime(true);
foreach ($array1 as $value) {
    while (current($array2) < $value) {
        if (next($array2) == false) break;
    }
    if (current($array2) != $value) $notin2[] = $value;
}
echo (microtime(true) - $start) . PHP_EOL;
echo 'plain diff' . PHP_EOL;
$start = microtime(true);
$diff = array_diff($array1, $array2);
echo (microtime(true) - $start) . PHP_EOL;
?>

Strings

creating
8.66658186913
key diff flip
3.07359004021
key diff
1.48775887489
sorting
48.4047858715
comparing
9.41756916046
plain diff
19.21976614

real    1m37.574s
user    1m34.970s
sys     0m1.676s

Integers (removed (string)):

creating
4.69975805283
key diff flip
2.5843539238
key diff
1.4868490696
sorting
15.2628200054
comparing
5.62516498566
plain diff
101.688895941

real    2m18.090s
user    2m15.112s
sys     0m1.356s

Much to my surprise..... array_diff hates integers it seems...

like image 24
Wrikken Avatar answered Oct 01 '22 15:10

Wrikken