I have 2 very large arrays (of size ~2,500,000). I need to find difference between these arrays. By difference I mean I need a resultant array with values that are in array 1 but not in array 2. I have used array_diff() but it takes more than half an hour!
The first array is coming from one DB and second array from another db. They aren't on the same Database server. The arrays aren't of same size. I am dealing with huge number of mobile numbers. I need to find out those mobile numbers which are in one list, but aren't in another list
arrays are normal arrays with numeric keys. Diff code is as follows:
$numbers_list = array_diff($numbers_list, $some_other_list);
Is there a better way of doing this? Please help.
This is the simple algorithm.
As you are working with very large arrays, it'll consume a lot of memory.
Here is my implementation,
$a = file("l.a"); // l.a is a file contains 2,500,000 lines
$b = file("l.b");
function large_array_diff($b, $a){
// Flipping
$at = array_flip($a);
$bt = array_flip($b);
// checking
$d = array_diff_key($bt, $at);
return array_keys($d);
}
I ran it using 4G
memory limit. 3G also works. Just tested.
$ time php -d memory_limit=4G diff_la.php
It took about 11 seconds!.
real 0m10.612s
user 0m8.940s
sys 0m1.460s
UPDATE
Following code runs 2x faster than large_array_diff
function stated above.
function flip_isset_diff($b, $a) {
$at = array_flip($a);
$d = array();
foreach ($b as $i)
if (!isset($at[$i]))
$d[] = $i;
return $d;
}
Because it does not call array_flip
(1 time), array_diff_key
and array_keys
. Lots of CPU cycles are saved due to this.
OK, updated, casting to string for good measure (it makes a LOT of difference if we could use integers instead of strings...):
<?php
$start = microtime(true);
echo 'creating' . PHP_EOL;
$array1 = array();
$array2 = array();
for ($i = 0;$i < 2000000;$i++) {
$array1[] = (string)rand(100000000, 999999999);
$array2[] = (string)rand(100000000, 999999999);
}
echo (microtime(true) - $start) . PHP_EOL;
$start = microtime(true);
echo 'key diff flip' . PHP_EOL;
$array1f = array_flip($array1);
$array2f = array_flip($array2);
echo (microtime(true) - $start) . PHP_EOL;
$start = microtime(true);
echo 'key diff' . PHP_EOL;
$diff = array_diff_key($array1f, $array2f);
echo (microtime(true) - $start) . PHP_EOL;
$start = microtime(true);
echo 'sorting' . PHP_EOL;
$start = microtime(true);
sort($array1);
sort($array2);
$notin2 = array();
reset($array2);
echo (microtime(true) - $start) . PHP_EOL;
echo 'comparing' . PHP_EOL;
$start = microtime(true);
foreach ($array1 as $value) {
while (current($array2) < $value) {
if (next($array2) == false) break;
}
if (current($array2) != $value) $notin2[] = $value;
}
echo (microtime(true) - $start) . PHP_EOL;
echo 'plain diff' . PHP_EOL;
$start = microtime(true);
$diff = array_diff($array1, $array2);
echo (microtime(true) - $start) . PHP_EOL;
?>
Strings
creating
8.66658186913
key diff flip
3.07359004021
key diff
1.48775887489
sorting
48.4047858715
comparing
9.41756916046
plain diff
19.21976614
real 1m37.574s
user 1m34.970s
sys 0m1.676s
Integers
(removed (string)
):
creating
4.69975805283
key diff flip
2.5843539238
key diff
1.4868490696
sorting
15.2628200054
comparing
5.62516498566
plain diff
101.688895941
real 2m18.090s
user 2m15.112s
sys 0m1.356s
Much to my surprise..... array_diff
hates integers it seems...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With