How to aggregate matching pairs into "connected components" in Python

Real-world problem:

I have data on directors across many firms, but sometimes "John Smith, director of XYZ" and "John Smith, director of ABC" are the same person, sometimes they're not. Also "John J. Smith, director of XYZ" and "John Smith, director of ABC" might be the same person, or might not be. Often examination of additional information (e.g., comparison of biographical data on "John Smith, director of XYZ" and "John Smith, director of ABC") makes it possible to resolve whether two observations are the same person or not.

Conceptual version of the problem:

In that spirit, am collecting data that will identify matching pairs. For example, suppose I have the following matching pairs: {(a, b), (b, c), (c, d), (d, e), (f, g)}. I want to use the transitivity property of the relation "is the same person as" to generate "connected components" of {{a, b, c, d, e}, {f, g}}. That is {a, b, c, d, e} is one person and {f, g} is another. (An earlier version of the question referred to "cliques", which are apparently something else; this would explain why find_cliques in networkx was giving the "wrong" results (for my purposes).

The following Python code does the job. But I wonder: is there a better (less computationally costly) approach (e.g., using standard or available libraries)?

There are examples here and there that seem related (e.g., Cliques in python), but these are incomplete, so I am not sure what libraries they are referring to or how to set up my data to use them.

Sample Python 2 code:

def get_cliques(pairs):
    from sets import Set

    set_list = [Set(pairs[0])]

    for pair in pairs[1:]:
        matched=False
        for set in set_list:
            if pair[0] in set or pair[1] in set:
                set.update(pair)
                matched=True
                break
        if not matched:
            set_list.append(Set(pair))

    return set_list

pairs = [('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e'), ('f', 'g')]

print(get_cliques(pairs))

This produces the desired output: [Set(['a', 'c', 'b', 'e', 'd']), Set(['g', 'f'])].

Sample Python 3 code:

This produces [set(['a', 'c', 'b', 'e', 'd']), set(['g', 'f'])]):

def get_cliques(pairs):

    set_list = [set(pairs[0])]

    for pair in pairs[1:]:
        matched=False
        for a_set in set_list:
            if pair[0] in a_set or pair[1] in a_set:
                a_set.update(pair)
                matched=True
                break
        if not matched:
            set_list.append(set(pair))

    return set_list

pairs = [('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e'), ('f', 'g')]

print(get_cliques(pairs))

248

asked Jan 15 '15 15:01

Ian Gow

2 Answers

With networkX:

import networkx as nx
G1=nx.Graph()
G1.add_edges_from([("a","b"),("b","c"),("c","d"),("d","e"),("f","g")])
sorted(nx.connected_components(G1), key = len, reverse=True)

giving:

[['a', 'd', 'e', 'b', 'c'], ['f', 'g']]

You have to check the fastest algorithm now ...

OP:

This works great! I have this in my PostgreSQL database now. Just organize pairs into a two-column table, then use array_agg() to pass to PL/Python function get_connected(). Thanks.

CREATE OR REPLACE FUNCTION get_connected(
    lhs text[],
    rhs text[])
  RETURNS SETOF text[] AS
$BODY$
    pairs = zip(lhs, rhs)

    import networkx as nx
    G=nx.Graph()
    G.add_edges_from(pairs)
    return sorted(nx.connected_components(G), key = len, reverse=True)

$BODY$ LANGUAGE plpythonu;

(Note: I edited answer, as I thought showing this step might be helpful addendum, but too long for a comment.)

142

answered Oct 05 '22 16:10

Gerard Rozsavolgyi

I don't believe (correct me if I'm wrong) that this is directly related to the largest clique problem. The definition of cliques (wikipedia) says that a clique "in an undirected graph is a subset of its vertices such that every two vertices in the subset are connected by an edge". In this case, we want to find which nodes can reach eachother (even indirectly).

I made a little sample. It builds a graph and traverses it looking for neighbors. This should be pretty efficient since each node is only traversed once when groups are formed.

from collections import defaultdict

def get_cliques(pairs):
    # Build a graph using the pairs
    nodes = defaultdict(lambda: [])
    for a, b in pairs:
        if b is not None:
            nodes[a].append((b, nodes[b]))
            nodes[b].append((a, nodes[a]))
        else:
            nodes[a]  # empty list

    # Add all neighbors to the same group    
    visited = set()
    def _build_group(key, group):
        if key in visited:
            return
        visited.add(key)
        group.add(key)
        for key, _ in nodes[key]:
            _build_group(key, group)

    groups = []
    for key in nodes.keys():
        if key in visited: continue
        groups.append(set())
        _build_group(key, groups[-1])

    return groups

if __name__ == '__main__':
    pairs = [
        ('a', 'b'), ('b', 'c'), ('b', 'd'), # a "tree"
        ('f', None),                        # no relations
        ('h', 'i'), ('i', 'j'), ('j', 'h')  # circular
    ]
    print get_cliques(pairs)
    # Output: [set(['a', 'c', 'b', 'd']), set(['f']), set(['i', 'h', 'j'])]

If your data set is best modeled like a graph and really big, maybe a graph database such as Neo4j is appropriate?

answered Oct 05 '22 17:10

André Laszlo

Related questions
                            
                                Multiple mouse pointers?
                            
                                How to set timeout detection on a RabbitMQ server?
                            
                                How to do Python's zip in C#?
                            
                                Is there a performance gain from defining routes in app.yaml versus one large mapping in a WSGIApplication in AppEngine?
                            
                                python logging alternatives [closed]
                            
                                Python/Scipy 2D Interpolation (Non-uniform Data)
                            
                                Django: why are Django model fields class attributes?
                            
                                What's your folder layout for a Flask app divided in modules?
                            
                                pickling error in python?
                            
                                mod_wsgi and multiple installations of python
                            
                                lxml not adding newlines when inserting a new element into existing xml
                            
                                RFCOMM without pairing using PyBluez on Debian?
                            
                                Multidimensional Scaling Fitting in Numpy, Pandas and Sklearn (ValueError)
                            
                                What part of speech does "s" stand for in WordNet synsets
                            
                                selenium.common.exceptions.WebDriverException: Message: 'Can not connect to GhostDriver'
                            
                                multiprocessing.Process.is_alive() returns True although process has finished, why?
                            
                                argparse argument dependency
                            
                                Multiprocessing of shared list
                            
                                How to Zoom with Axes3D in Matplotlib
                            
                                Why does python print version info to stderr?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to aggregate matching pairs into "connected components" in Python

Tags:

python

postgresql

connected-components

plpython

clique