I was wondering how can we can use the python module <code>networkX</code> to implement SimRank to compare the similarity of 2 nodes? I understand that <code>networkX</code> provides methods for looking at neighbors, and link analysis algorithms such as PageRank and HITS, but is there one for SimRank? Examples, tutorials are welcomed too!

Update I implemented an networkx_addon library. SimRank is included in the library. Check out: https://github.com/hhchen1105/networkx_addon for details. Sample Usage: <pre class="prettyprint"><code> >>> import networkx >>> import networkx_addon >>> G = networkx.Graph() >>> G.add_edges_from([('a','b'), ('b','c'), ('a','c'), ('c','d')]) >>> s = networkx_addon.similarity.simrank(G) </code></pre> You may obtain the similarity score between two nodes (say, node 'a' and node 'b') by <pre class="prettyprint"><code> >>> print s['a']['b'] </code></pre> <hr> SimRank is a vertex similarity measure. It computes the similarity between two nodes on a graph based on the topology, i.e., the nodes and the links of the graph. To illustrate SimRank, let's consider the following graph, in which a, b, c connect to each other, and d is connected to d. How a node a is similar to a node d, is based on how a's neighbor nodes, b and c, similar to d's neighbors, c. <pre class="prettyprint"><code> +-------+ | | a---b---c---d </code></pre> As seen, this is a recursive definition. Thus, SimRank is recursively computed until the similarity values converges. Note that SimRank introduces a constant r to represents the relative importance between in-direct neighbors and direct neighbors. The formal equation of SimRank can be found here. The following function takes a networkx graph $G$ and the relative imporance parameter r as input, and returns the simrank similarity value sim between any two nodes in G. The return value sim is a dictionary of dictionary of float. To access the similarity between node a and node b in graph G, one can simply access sim[a][b]. <pre class="prettyprint"><code> def simrank(G, r=0.9, max_iter=100): # init. vars sim_old = defaultdict(list) sim = defaultdict(list) for n in G.nodes(): sim[n] = defaultdict(int) sim[n][n] = 1 sim_old[n] = defaultdict(int) sim_old[n][n] = 0 # recursively calculate simrank for iter_ctr in range(max_iter): if _is_converge(sim, sim_old): break sim_old = copy.deepcopy(sim) for u in G.nodes(): for v in G.nodes(): if u == v: continue s_uv = 0.0 for n_u in G.neighbors(u): for n_v in G.neighbors(v): s_uv += sim_old[n_u][n_v] sim[u][v] = (r * s_uv / (len(G.neighbors(u)) * len(G.neighbors(v)))) return sim def _is_converge(s1, s2, eps=1e-4): for i in s1.keys(): for j in s1[i].keys(): if abs(s1[i][j] - s2[i][j]) >= eps: return False return True </code></pre> To calculate the similarity values between nodes in the above graph, you can try this. <pre class="prettyprint"><code> >> G = networkx.Graph() >> G.add_edges_from([('a','b'), ('b', 'c'), ('c','a'), ('c','d')]) >> simrank(G) </code></pre> You'll get <pre class="prettyprint"><code> defaultdict(<type 'list'>, {'a': defaultdict(<type 'int'>, {'a': 0, 'c': 0.62607626807407868, 'b': 0.65379221101693585, 'd': 0.7317028881451203}), 'c': defaultdict(<type 'int'>, {'a': 0.62607626807407868, 'c': 0, 'b': 0.62607626807407868, 'd': 0.53653543888775579}), 'b': defaultdict(<type 'int'>, {'a': 0.65379221101693585, 'c': 0.62607626807407868, 'b': 0, 'd': 0.73170288814512019}), 'd': defaultdict(<type 'int'>, {'a': 0.73170288814512019, 'c': 0.53653543888775579, 'b': 0.73170288814512019, 'd': 0})}) </code></pre> Let's verify the result by calculating similarity between, say, node a and node b, denoted by S(a,b). S(a,b) = r * (S(b,a)+S(b,c)+S(c,a)+S(c,c))/(2*2) = 0.9 * (0.6538+0.6261+0.6261+1)/4 = 0.6538, which is the same as our calculated S(a,b) above. For more details, you may want to checkout the following paper: G. Jeh and J. Widom. SimRank: a measure of structural-context similarity. In KDD'02 pages 538-543. ACM Press, 2002.

Calculating SimRank using NetworkX?

Tags:

python

graph-algorithm

networkx

I was wondering how can we can use the python module networkX to implement SimRank to compare the similarity of 2 nodes? I understand that networkX provides methods for looking at neighbors, and link analysis algorithms such as PageRank and HITS, but is there one for SimRank?

Examples, tutorials are welcomed too!

804

asked Mar 19 '12 09:03

DjangoRocks

2 Answers

Update I implemented an networkx_addon library. SimRank is included in the library. Check out: https://github.com/hhchen1105/networkx_addon for details.

Sample Usage:

    >>> import networkx
    >>> import networkx_addon
    >>> G = networkx.Graph()
    >>> G.add_edges_from([('a','b'), ('b','c'), ('a','c'), ('c','d')])
    >>> s = networkx_addon.similarity.simrank(G)

You may obtain the similarity score between two nodes (say, node 'a' and node 'b') by

    >>> print s['a']['b']

SimRank is a vertex similarity measure. It computes the similarity between two nodes on a graph based on the topology, i.e., the nodes and the links of the graph. To illustrate SimRank, let's consider the following graph, in which a, b, c connect to each other, and d is connected to d. How a node a is similar to a node d, is based on how a's neighbor nodes, b and c, similar to d's neighbors, c.

    +-------+
    |       |
    a---b---c---d

As seen, this is a recursive definition. Thus, SimRank is recursively computed until the similarity values converges. Note that SimRank introduces a constant r to represents the relative importance between in-direct neighbors and direct neighbors. The formal equation of SimRank can be found here.

The following function takes a networkx graph $G$ and the relative imporance parameter r as input, and returns the simrank similarity value sim between any two nodes in G. The return value sim is a dictionary of dictionary of float. To access the similarity between node a and node b in graph G, one can simply access sim[a][b].

    def simrank(G, r=0.9, max_iter=100):
      # init. vars
      sim_old = defaultdict(list)
      sim = defaultdict(list)
      for n in G.nodes():
        sim[n] = defaultdict(int)
        sim[n][n] = 1
        sim_old[n] = defaultdict(int)
        sim_old[n][n] = 0

      # recursively calculate simrank
      for iter_ctr in range(max_iter):
        if _is_converge(sim, sim_old):
          break
        sim_old = copy.deepcopy(sim)
        for u in G.nodes():
          for v in G.nodes():
            if u == v:
              continue
            s_uv = 0.0
            for n_u in G.neighbors(u):
              for n_v in G.neighbors(v):
                s_uv += sim_old[n_u][n_v]
            sim[u][v] = (r * s_uv / (len(G.neighbors(u)) * len(G.neighbors(v))))
      return sim

    def _is_converge(s1, s2, eps=1e-4):
      for i in s1.keys():
        for j in s1[i].keys():
          if abs(s1[i][j] - s2[i][j]) >= eps:
            return False
      return True

To calculate the similarity values between nodes in the above graph, you can try this.

    >> G = networkx.Graph()
    >> G.add_edges_from([('a','b'), ('b', 'c'), ('c','a'), ('c','d')])
    >> simrank(G)

You'll get

    defaultdict(<type 'list'>, {'a': defaultdict(<type 'int'>, {'a': 0, 'c': 0.62607626807407868, 'b': 0.65379221101693585, 'd': 0.7317028881451203}), 'c': defaultdict(<type 'int'>, {'a': 0.62607626807407868, 'c': 0, 'b': 0.62607626807407868, 'd': 0.53653543888775579}), 'b': defaultdict(<type 'int'>, {'a': 0.65379221101693585, 'c': 0.62607626807407868, 'b': 0, 'd': 0.73170288814512019}), 'd': defaultdict(<type 'int'>, {'a': 0.73170288814512019, 'c': 0.53653543888775579, 'b': 0.73170288814512019, 'd': 0})})

Let's verify the result by calculating similarity between, say, node a and node b, denoted by S(a,b).

S(a,b) = r * (S(b,a)+S(b,c)+S(c,a)+S(c,c))/(2*2) = 0.9 * (0.6538+0.6261+0.6261+1)/4 = 0.6538,

which is the same as our calculated S(a,b) above.

For more details, you may want to checkout the following paper:

G. Jeh and J. Widom. SimRank: a measure of structural-context similarity. In KDD'02 pages 538-543. ACM Press, 2002.

answered Sep 29 '22 01:09

user1036719

No, simrank is not implemented in networkx.

If you were to add this to networkx, you could shorten the code given by user1036719 by using numpy and itertools:

def simrank(G, r=0.8, max_iter=100, eps=1e-4):

    nodes = G.nodes()
    nodes_i = {k: v for(k, v) in [(nodes[i], i) for i in range(0, len(nodes))]}

    sim_prev = numpy.zeros(len(nodes))
    sim = numpy.identity(len(nodes))

    for i in range(max_iter):
        if numpy.allclose(sim, sim_prev, atol=eps):
            break
        sim_prev = numpy.copy(sim)
        for u, v in itertools.product(nodes, nodes):
            if u is v:
                continue
            u_ns, v_ns = G.predecessors(u), G.predecessors(v)

            # evaluating the similarity of current iteration nodes pair
            if len(u_ns) == 0 or len(v_ns) == 0: 
                # if a node has no predecessors then setting similarity to zero
                sim[nodes_i[u]][nodes_i[v]] = 0
            else:                    
                s_uv = sum([sim_prev[nodes_i[u_n]][nodes_i[v_n]] for u_n, v_n in itertools.product(u_ns, v_ns)])
                sim[nodes_i[u]][nodes_i[v]] = (r * s_uv) / (len(u_ns) * len(v_ns))


    return sim

Then, taking the toy example from the SimRank paper (University graph), reproduces the paper results:

G = networkx.DiGraph()
G.add_edges_from([('1','2'), ('1', '4'), ('2','3'), ('3','1'), ('4', '5'), ('5', '4')])
pprint(simrank(G).round(3))

Which outputs:

array([[ 1.   ,  0.   ,  0.   ,  0.034,  0.132],
       [ 0.   ,  1.   ,  0.   ,  0.331,  0.042],
       [ 0.   ,  0.   ,  1.   ,  0.106,  0.414],
       [ 0.034,  0.331,  0.106,  1.   ,  0.088],
       [ 0.132,  0.042,  0.414,  0.088,  1.   ]])

answered Sep 29 '22 00:09

Jon

Related questions
                            
                                How to strip color codes used by mIRC users?
                            
                                Python web scraping involving HTML tags with attributes
                            
                                Adding custom JS to a django admin field
                            
                                Python Beginner: Selective Printing in loops
                            
                                Python help reading csv file failing due to line-endings
                            
                                Writing to a file in Python inserts null bytes
                            
                                Is it possible to issue a "VACUUM ANALYZE <tablename>" from psycopg2 or sqlalchemy for PostgreSQL?
                            
                                Python: removing a TKinter frame
                            
                                Displaying list in Java as elegant as in Python
                            
                                Is there a built-in function to sort and filter a python list in one step?
                            
                                Python select random date in current year
                            
                                Download several parts of one file concurrently with Python?
                            
                                mysqldb interfaceError
                            
                                Using WTForms' populate_obj( ) method with Flask micro framework
                            
                                Why should you lock threads?
                            
                                Invert keys and values of the original dictionary
                            
                                sort list of floating-point numbers in groups
                            
                                Collecting results from a loop that returns NumPy Arrays
                            
                                Windows Python Version and VC++ Redistributable Version
                            
                                Excluding primary key in Django dumpdata with natural keys

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With