Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating SimRank using NetworkX?

I was wondering how can we can use the python module networkX to implement SimRank to compare the similarity of 2 nodes? I understand that networkX provides methods for looking at neighbors, and link analysis algorithms such as PageRank and HITS, but is there one for SimRank?

Examples, tutorials are welcomed too!

like image 804
DjangoRocks Avatar asked Mar 19 '12 09:03

DjangoRocks


People also ask

How does NetworkX calculate average degree?

The average degree of an undirected graph is the sum of the degrees of all its nodes divided by the number of nodes in the graph.

How do I find the degree of a node in NetworkX?

The degree is the sum of the edge weights adjacent to the node.

What is SimRank explain in detail?

SimRank is a general similarity measure, based on a simple and intuitive graph-theoretic model. SimRank is applicable in any domain with object-to-object relationships, that measures similarity of the structural context in which objects occur, based on their relationships with other objects.


2 Answers

Update I implemented an networkx_addon library. SimRank is included in the library. Check out: https://github.com/hhchen1105/networkx_addon for details.

Sample Usage:

    >>> import networkx
    >>> import networkx_addon
    >>> G = networkx.Graph()
    >>> G.add_edges_from([('a','b'), ('b','c'), ('a','c'), ('c','d')])
    >>> s = networkx_addon.similarity.simrank(G)

You may obtain the similarity score between two nodes (say, node 'a' and node 'b') by

    >>> print s['a']['b']

SimRank is a vertex similarity measure. It computes the similarity between two nodes on a graph based on the topology, i.e., the nodes and the links of the graph. To illustrate SimRank, let's consider the following graph, in which a, b, c connect to each other, and d is connected to d. How a node a is similar to a node d, is based on how a's neighbor nodes, b and c, similar to d's neighbors, c.

    +-------+
    |       |
    a---b---c---d

As seen, this is a recursive definition. Thus, SimRank is recursively computed until the similarity values converges. Note that SimRank introduces a constant r to represents the relative importance between in-direct neighbors and direct neighbors. The formal equation of SimRank can be found here.

The following function takes a networkx graph $G$ and the relative imporance parameter r as input, and returns the simrank similarity value sim between any two nodes in G. The return value sim is a dictionary of dictionary of float. To access the similarity between node a and node b in graph G, one can simply access sim[a][b].

    def simrank(G, r=0.9, max_iter=100):
      # init. vars
      sim_old = defaultdict(list)
      sim = defaultdict(list)
      for n in G.nodes():
        sim[n] = defaultdict(int)
        sim[n][n] = 1
        sim_old[n] = defaultdict(int)
        sim_old[n][n] = 0

      # recursively calculate simrank
      for iter_ctr in range(max_iter):
        if _is_converge(sim, sim_old):
          break
        sim_old = copy.deepcopy(sim)
        for u in G.nodes():
          for v in G.nodes():
            if u == v:
              continue
            s_uv = 0.0
            for n_u in G.neighbors(u):
              for n_v in G.neighbors(v):
                s_uv += sim_old[n_u][n_v]
            sim[u][v] = (r * s_uv / (len(G.neighbors(u)) * len(G.neighbors(v))))
      return sim

    def _is_converge(s1, s2, eps=1e-4):
      for i in s1.keys():
        for j in s1[i].keys():
          if abs(s1[i][j] - s2[i][j]) >= eps:
            return False
      return True

To calculate the similarity values between nodes in the above graph, you can try this.

    >> G = networkx.Graph()
    >> G.add_edges_from([('a','b'), ('b', 'c'), ('c','a'), ('c','d')])
    >> simrank(G)

You'll get

    defaultdict(<type 'list'>, {'a': defaultdict(<type 'int'>, {'a': 0, 'c': 0.62607626807407868, 'b': 0.65379221101693585, 'd': 0.7317028881451203}), 'c': defaultdict(<type 'int'>, {'a': 0.62607626807407868, 'c': 0, 'b': 0.62607626807407868, 'd': 0.53653543888775579}), 'b': defaultdict(<type 'int'>, {'a': 0.65379221101693585, 'c': 0.62607626807407868, 'b': 0, 'd': 0.73170288814512019}), 'd': defaultdict(<type 'int'>, {'a': 0.73170288814512019, 'c': 0.53653543888775579, 'b': 0.73170288814512019, 'd': 0})})

Let's verify the result by calculating similarity between, say, node a and node b, denoted by S(a,b).

S(a,b) = r * (S(b,a)+S(b,c)+S(c,a)+S(c,c))/(2*2) = 0.9 * (0.6538+0.6261+0.6261+1)/4 = 0.6538,

which is the same as our calculated S(a,b) above.

For more details, you may want to checkout the following paper:

G. Jeh and J. Widom. SimRank: a measure of structural-context similarity. In KDD'02 pages 538-543. ACM Press, 2002.

like image 54
user1036719 Avatar answered Sep 29 '22 01:09

user1036719


No, simrank is not implemented in networkx.

If you were to add this to networkx, you could shorten the code given by user1036719 by using numpy and itertools:

def simrank(G, r=0.8, max_iter=100, eps=1e-4):

    nodes = G.nodes()
    nodes_i = {k: v for(k, v) in [(nodes[i], i) for i in range(0, len(nodes))]}

    sim_prev = numpy.zeros(len(nodes))
    sim = numpy.identity(len(nodes))

    for i in range(max_iter):
        if numpy.allclose(sim, sim_prev, atol=eps):
            break
        sim_prev = numpy.copy(sim)
        for u, v in itertools.product(nodes, nodes):
            if u is v:
                continue
            u_ns, v_ns = G.predecessors(u), G.predecessors(v)

            # evaluating the similarity of current iteration nodes pair
            if len(u_ns) == 0 or len(v_ns) == 0: 
                # if a node has no predecessors then setting similarity to zero
                sim[nodes_i[u]][nodes_i[v]] = 0
            else:                    
                s_uv = sum([sim_prev[nodes_i[u_n]][nodes_i[v_n]] for u_n, v_n in itertools.product(u_ns, v_ns)])
                sim[nodes_i[u]][nodes_i[v]] = (r * s_uv) / (len(u_ns) * len(v_ns))


    return sim

Then, taking the toy example from the SimRank paper (University graph), reproduces the paper results:

G = networkx.DiGraph()
G.add_edges_from([('1','2'), ('1', '4'), ('2','3'), ('3','1'), ('4', '5'), ('5', '4')])
pprint(simrank(G).round(3))

Which outputs:

array([[ 1.   ,  0.   ,  0.   ,  0.034,  0.132],
       [ 0.   ,  1.   ,  0.   ,  0.331,  0.042],
       [ 0.   ,  0.   ,  1.   ,  0.106,  0.414],
       [ 0.034,  0.331,  0.106,  1.   ,  0.088],
       [ 0.132,  0.042,  0.414,  0.088,  1.   ]])
like image 27
Jon Avatar answered Sep 29 '22 00:09

Jon