Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

optimal algorithm grouping data in javascript

The following (simplified) json type of data defines a Contact:

{
  id:   number;
  name: string;
  phone: string;
  email: string
}

There is the following group of data:

+---+----------+-------------+---------------------------+ 
|id | name     | phone       |email                      | 
+---+----------+-------------+---------------------------+
|1  | John     | 11111111    |[email protected]              | 
|2  | Marc     | 22222222    |[email protected]              | 
|3  | Ron      | 99999999    |[email protected]              |
|4  | Andrew   | 55555555    |[email protected]              |
|5  | Wim      | 99999999    |[email protected]              |
|6  | Marc     | 33333333    |[email protected]              |
|7  | Dan      | 44444444    |[email protected]              |
+---+----------+-------------+---------------------------+

Goal is to find the groups that belong together using javascript(optionally in lodash, but main idea is to get the algorithm clear) according to the following constraint: a contact belongs to a group when any of the following criteria are the same: name, phone or email. The results shows the id's grouped as arrays in arrays. a contact in a group of 1 is ignored.

In the example above, it means that contacts with ids 1,3,5 belong together since 1,3 share the same email and 3 and 5 share the same phonenumber. Likewise 2,6,7: 2 and 6 have the same name and 6 and 7 have the same email. 5 does not have anything in common. The expected result therefore is:
[[1,3,5], [2,6,7]]

Background: One solution that works is iterating over every item and check for the remainder of the list if name, email, or phone is the same. If so, group them and take them out of the list (in the example we compare 1 with all the items in the list and only 3 is found). problem is that next items also need to be checked again to these groups, because in this case 5 is not yet detected as part of the group. This makes the algorithm complex, while I suspect there is a simple way of solving this in linear time. There might also be a name for this class of problems? `

like image 568
Han Avatar asked Nov 20 '18 09:11

Han


2 Answers

Idea:

  • Start with 0 groups
  • Iterate your list of contacts
  • Check if there is a group with the contact name, or phone, or email. Merge all the members of those groups as the same group. Then add yourself to that group. If not, begin a new group with yourself and set the name, phone and email group as yourself.

Union find is an efficient structure to handle merging of disjoint sets. Code taken from here. As it uses path compression and union by rank, you can consider that the whole code is linear in the amount of contacts.

var data = [
      {id:1,name:'John',phone:'11111111',email:'[email protected]'},
      {id:2,name:'Marc',phone:'99999999',email:'[email protected]'},
      {id:3,name:'Ron',phone:'99999999',email:'[email protected]'},
      {id:4,name:'Andrew',phone:'55555555',email:'[email protected]'},
      {id:5,name:'Wim',phone:'99999999',email:'[email protected]'},
      {id:6,name:'Marc',phone:'33333333',email:'[email protected]'},
      {id:7,name:'Dan',phone:'44444444',email:'[email protected]'}
];

// UNION-FIND structure, with path comression and union by rank

var UNIONFIND = (function () {
    
    function _find(n)
    {
        if(n.parent == n) return n;	
        n.parent = _find(n.parent);	
        return n.parent;
    }
    
    return {
        makeset:function(id){    
            var newnode = {
                parent: null,
                id: id,
                rank: 0
            };
            newnode.parent = newnode;            
            return newnode;
        },
    
        find: _find,
     
        combine: function(n1, n2) {                                    
            var n1 = _find(n1);
            var n2 = _find(n2);
            
            if (n1 == n2) return;
        
            if(n1.rank < n2.rank)
            {
                n2.parent = n2;
                return n2;
            }
            else if(n2.rank < n1.rank)
            {
                n2.parent = n1;
                return n1;
            }
            else
            {
                n2.parent = n1;
                n1.rank += 1;
                return n1;
            }
        }
    };
})();

var groupHash = {name: {}, phone: {}, email: {}}
var groupNodes = []

data.forEach(function(contact){
  var group = UNIONFIND.makeset(contact.id);
  var groups = new Set();
  ["name", "phone", "email"].forEach(function(attr){
    if (groupHash[attr].hasOwnProperty(contact[attr])) groups.add(groupHash[attr][contact[attr]])
  });
  
  groups = Array.from(groups);
  groups.push(group);
  groupNodes.push(group);
  
  for(var i = 1; i < groups.length; i++) {
    UNIONFIND.combine(groups[0], groups[i]);
  }  
  
  ["name", "phone", "email"].forEach(function(attr){
      groupHash[attr][contact[attr]] = groups[0];
  });
  
})

var contactsInGroup = {}


groupNodes.forEach(function(group){
    var groupId = UNIONFIND.find(group).id;
    
    if (contactsInGroup.hasOwnProperty(groupId) == false) {
      contactsInGroup[groupId] = [];
    }
    
    contactsInGroup[groupId].push(group.id);
})

var result = Object.values(contactsInGroup).filter(function(list){
 return list.length > 1
})

console.log(result)
like image 64
juvian Avatar answered Oct 15 '22 18:10

juvian


Any answer that iterates over each of n entries, and then over a growing list of m groups to match against is going to have worst-time performance of O(n*m) (found when there are no two entries that match on any term).

Any answer that iterates over each entry, and then over groups, and uses arrays to test for matching values among q options will further have to pay a penalty of O(q) per match. In the worst case, with say all e-mails the same and all phones different, this will mean O(n*m).

I believe this answer is O(n), because assuming that the number of fields to match with is a constant (in this case, 3: name, phone and email), all operations in the main loop, which is run once per entry, are O(1).

There is an extra complication to fix the fact that, late in the process, we may find a bridge between two (or even 3) groups, as entries can match on different fields with entries from different groups. This may happen several times. To avoid having to rebuild groups during the main loop, we leave merging to the very end, where we first build a map of what-group-ends-up-where, and then finally move all entry IDs to their final group. This can all be done in O(m), with m the number of groups; with an extra O(n) when actually copying entry IDs to the merged groups: overall, we are still in O(n) territory.

The last line builds arrays of ids from the merged groups, and filters out any that does not have over 1 element.

const data = [
    {id:1,name:'John',phone:'11111111',email:'[email protected]'},
    {id:2,name:'Marc',phone:'99999999',email:'[email protected]'},
    {id:3,name:'Ron',phone:'99999999',email:'[email protected]'},
    {id:4,name:'Andrew',phone:'55555555',email:'[email protected]'},
    {id:5,name:'Wim',phone:'99999999',email:'[email protected]'},
    {id:6,name:'Marc',phone:'33333333',email:'[email protected]'},
    {id:7,name:'Dan',phone:'44444444',email:'[email protected]'}
];

const groups = function(inputs) {

    let valuesToGroups = new Map(
        ['name', 'phone', 'email'].map(key => [key, new Map()]));
    let groups = new Map();
    let pendingMerges = [];
    for (const entry of inputs) {
        let group = undefined;
        let found = [];
        for (const [key, valueMap] of valuesToGroups) {
            // look up value in values-index for current key
            group = valueMap.get(entry[key]);
            if (group !== undefined) {
                found.push(group);
                // not breaking allows groups to be merged
            }
        }
        if (found.length === 0) {
            // not found: create new group
            group = groups.size;
            groups.set(group, [entry.id]);
        } else {
            // found: add entry to chosen group
            group = found[0];
            groups.get(group).push(entry.id);
            if (found.length > 1) {
                pendingMerges.push(found);
            }
        }
        // add entry's values to index, pointing to group
        for (const [key, valueMap] of valuesToGroups) {
            valueMap.set(entry[key], group); 
        }        
    }
    // do pending merges; initially, all groups are stand-alone
    let merges = new Map(Array.from(groups.keys()).map(k => [k, k]));
    for (const merge of pendingMerges) {
        // contents will go to the lowest-numbered group
        const sorted = merge.map(groupId => merges.get(groupId)).sort();
        sorted.forEach(groupId => merges.set(groupId, sorted[0]));
    }
    const cleanGroups = new Map();
    groups.forEach((value, key) => { 
        const k = merges.get(key); 
        if ( ! cleanGroups.has(k)) {
            cleanGroups.set(k, []);
        }
        value.forEach(id => cleanGroups.get(k).push(id))
    })
    // return only non-empty groups
    return [... cleanGroups].filter(g => g[1].length>1).map(g => [... g[1]]);
}(data);

console.log(""+JSON.stringify(groups))
// output is [[1,2,3,5,6,7]]
like image 37
tucuxi Avatar answered Oct 15 '22 19:10

tucuxi