Trying to group values?

Question

I have some data like this:

and am looking for an output like this (group-id and the members of that group):

1: 1 2 6
2: 3 4 7
3: 5 9

First row because 1 is "connected" to 2 and 2 is connected to 6. Second row because 3 is connected to 4 and 3 is connected to 7

This looked to me like a graph traversal but the final order does not matter so I was wondering if someone can suggest a simpler solution that I can use on a large dataset (billions of entries).

From the comments:

The problem is to find the set of disjoint sub-graphs given a set of edges.
The edges are not directed; the line '1 2' means that 1 is connected to 2 and 2 is connected to 1.
The '1:' in the sample output could be 'A:' without changing the meaning of the answer.

EDIT 1:

Problem looks solved now. Thanks to everyone for their help. I need some more help picking the best solution that can be used on billions of such entries.

EDIT 2:

Test Input file:

Benchmarks:

I tried out everything and the version posted by TokenMacGuy is the fastest on the sample dataset that I tried. The dataset has about 1 million entries for which it took me about 6 seconds on a Dual Quad-Core 2.4GHz machine. I haven't gotten a chance to run it on the entire dataset yet but I will post the benchmark as soon as it is available.

SingleNegationElimination · Accepted Answer

I've managed O(n log n).

Here is a (somewhat intense) C++ implementation:

#include <boost/pending/disjoint_sets.hpp>
#include <boost/property_map/property_map.hpp>

#include <map>
#include <set>
#include <iostream>


typedef std::map<int, int> rank_t;
typedef std::map<int, int> parent_t;

typedef boost::associative_property_map< rank_t > rank_pmap_t;
typedef boost::associative_property_map< parent_t > parent_pmap_t;

typedef boost::disjoint_sets< rank_pmap_t, parent_pmap_t > group_sets_t;

typedef std::set<int> key_set;
typedef std::map<int, std::set<int> > output;

With some typedefs out of the way, here's the real meat. I'm using boost::disjoint_sets, which is just happens to be an exceptionally good representation for the problem. This first function checks to see if either of the keys given have been seen before, and adds them to the collections if needed. the important part is really the union_set(a, b) which links the two sets together. If one or the other of the sets are already in the groups collection, they get linked too.

void add_data(int a, int b, group_sets_t & groups, key_set & keys)
{
  if (keys.count(a) < 1) groups.make_set(a);
  if (keys.count(b) < 1) groups.make_set(b);
  groups.union_set(a, b);
  keys.insert(a);
  keys.insert(b);
}

This isn't too exciting, it just iterates through all of the keys we've seen and gets the representative key for that key, then adds the pair (representative, key) to a map. Once that's done, print out the map.

void build_output(group_sets_t & groups, key_set & keys)
{
  output out;
  for (key_set::iterator i(keys.begin()); i != keys.end(); i++)
    out[groups.find_set(*i)].insert(*i);

  for (output::iterator i(out.begin()); i != out.end(); i++)
  {
    std::cout << i->first << ": ";
    for (output::mapped_type::iterator j(i->second.begin()); j != i->second.end(); j++)
      std::cout << *j << " ";
    std::cout << std::endl;
  }
}

int main()
{

  rank_t rank;
  parent_t parent;
  rank_pmap_t rank_index(rank);
  parent_pmap_t parent_index(parent);


  group_sets_t groups( rank_index, parent_index );
  key_set keys;

  int a, b;
  while (std::cin >> a)
  {
    std::cin >> b;
    add_data(a, b, groups, keys);
  }  


  build_output(groups, keys);
  //std::cout << "number of sets: " << 
  //  groups.count_sets(keys.begin()), keys.end()) << std::endl;

}

I stayed up many hours learning how to use boost::disjoint_sets on this problem. There doesn't seem to be much of any documentation on it.

About the performance. The disjoint_sets structure is O(α(n) ) for its key operations (make_set, find_set and union_set) which is pretty close to constant, and so if it were just a matter of building the structure, the whole algorithm would be O(n α(n) ) (which is effectively O(n) ) but we have to print it out. That means we have to build up some associative containers, which cannot perform better than O(n log n). It might be possible to get a constant factor speedup by choosing a different associative containers (say, hash_set etc), since once you populate the initial list, you can reserve an optimal amount of space.

Legend · Answer

Ok so I got something working in parallel to the other solution posted by @Jonathan (first of all, many thanks for your time). My solution looks deceptively simple but would love some suggestions on whether this is correct (maybe I'm missing a corner case somewhere?) because it seems to produce the output I wanted but I'll have to parse it in a second pass to group the same group numbers which is trivial. The logic is that everytime it finds a new number not in the array it increments a group_id counter:

My code in PHP:

<?php

//$fp = fopen("./resemblance.1.out", "r");
$fp = fopen("./wrong", "r");

$groups = array();
$group["-1"] = 1;
$groups[] = $group;

$map = array();

//Maintain a count
$group = 1;

while(!feof($fp)) {
        $source = trim(fgets($fp, 4096));
        //echo $source."
";

        $source = explode(" ", $source);

        if(array_key_exists($source[0], $map) && !array_key_exists($source[1], $map)) {
                $map[$source[1]] = $map[$source[0]];
        } else if(array_key_exists($source[1], $map) && !array_key_exists($source[0], $map)) {
                $map[$source[0]] = $map[$source[1]];
        } else if(array_key_exists($source[1], $map) && array_key_exists($source[0], $map) && $map[$source[1]] != $map[$source[0]]) {
                // Adjust the groups - change the groups of one of the elements to the other
                $keys = array_keys($map, $map[$source[1]]);
                print_r($keys);
                foreach($keys as $key) {
                        $map[$key] = $map[$source[0]];
                }
        } else {
                $group++;
                $map[$source[0]] = $group;
                $map[$source[1]] = $group;
        }
}

print_r($map);
?>

Output:

Array
(
    [1] => 2
    [2] => 2
    [3] => 3
    [4] => 3
    [5] => 4
    [9] => 4
    [6] => 2
    [7] => 3
    [] => 5
)

EDIT: Fixed the bug that was mentioned in the comment. Just playing around out of curiosity :) Feel free to point out any other bugs. I am currently testing out which solution is faster.

Trying to group values?

Tags:

c++

python

algorithm

php

graph

Legend

2 Answers

SingleNegationElimination

Legend

Recent Activity

Donate For Us

Trying to group values?

Tags:

c++

python

algorithm

php

graph

Legend

2 Answers

SingleNegationElimination

Legend

Related questions

Recent Activity

Donate For Us