Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

difference of two arrays with 10 million items - _.difference is too slow

I have two arrays with user id's and I want to check different items in them.

arr1 = [123, 456, 789];
arr2 = [123, 456, 789, 098];

The problem is: these arrays can have 10 or 20 millions of items.

I'm trying with underscore.difference() but it took 10 mins to complete.

Is there any faster way to do this?

like image 472
user3175226 Avatar asked May 01 '14 18:05

user3175226


3 Answers

How about converting the arrays to object to reduce the sorting complexity:

var arr1 = [123, 456, 789], arr2 = [123, 456, 789, 098];

function toObject(arr){
    return arr.reduce(function(o, v, i) {
      o[v] = i;
      return o;
    }, {});
}

var o1 = toObject(arr1), o2 = toObject(arr2), diff = [];

for(var prop in o2){
    if(o1[prop] === undefined)
        diff.push(prop);
}

console.log(diff);

You would obviously need to start on the biggest set.

http://jsfiddle.net/sHUw5/

Another thing to consider is to sort your collections and do a binary search to reduce the complexity from (O)N to (O)log2N for each array (if I'm thinking straight).

like image 173
Johan Avatar answered Sep 30 '22 15:09

Johan


This considers you do not have the numbers 0 or 1 in the arrays:

var arr1 = [123, 456, 789,3],
    arr2 = [123, 456, 789, 098],    
    has = {},
    different=[],
    length1=arr1.length,
    length2=arr2.length;



for(var i=0;i<length1;i++){
        has[arr1[i]]=true;
}
for(var i=0;i<length2;i++){
    var val=arr2[i];
    if(has[val] === undefined){
       has[val]=val;
    }
    else{
        if(has[val]!=val){
            has[val]=false;
        }
   }
}
for(var i in has){
    if (has[i]) different.push(i);
}

If you want to check also for 0 and 1:

for(var i=0;i<length1;i++){
    has[arr1[i]]=NaN;
}
for(var i=0;i<length2;i++){
    var val=arr2[i];
    if(has[val] === undefined){
        has[val]=null;
    }
    else{
        if(has[val]!=null){
            has[val]=true;
        }
    }
}
for(var i in has){
    if (!has[i]) different.push(i);
}
like image 27
juvian Avatar answered Sep 30 '22 16:09

juvian


Use native js, rather than a library that tries to accommodate lots of scenarios / inputs.

  • Skip all the function calls and scope changes.
  • Minimize property lookup/namespace walking
  • Don't keep getting the array length on each loop

Simple optimization:

var array1 = [];
var array2 = [];

var difference = [];

for(var i = 0; len = array1.length; i < len; i++)
{
    var value = array1[i];

    if(value == array2[i])
    {
        continue;
    }

    if(array2.indexOf(value) == -1)
    {
        difference.push(value);
    }
}
like image 43
Metalstorm Avatar answered Sep 30 '22 15:09

Metalstorm