I was wondering how can we can use the python module networkX
to implement SimRank to compare the similarity of 2 nodes? I understand that networkX
provides methods for looking at neighbors, and link analysis algorithms such as PageRank and HITS, but is there one for SimRank?
Examples, tutorials are welcomed too!
The average degree of an undirected graph is the sum of the degrees of all its nodes divided by the number of nodes in the graph.
The degree is the sum of the edge weights adjacent to the node.
SimRank is a general similarity measure, based on a simple and intuitive graph-theoretic model. SimRank is applicable in any domain with object-to-object relationships, that measures similarity of the structural context in which objects occur, based on their relationships with other objects.
Update I implemented an networkx_addon library. SimRank is included in the library. Check out: https://github.com/hhchen1105/networkx_addon for details.
Sample Usage:
>>> import networkx
>>> import networkx_addon
>>> G = networkx.Graph()
>>> G.add_edges_from([('a','b'), ('b','c'), ('a','c'), ('c','d')])
>>> s = networkx_addon.similarity.simrank(G)
You may obtain the similarity score between two nodes (say, node 'a' and node 'b') by
>>> print s['a']['b']
SimRank is a vertex similarity measure. It computes the similarity between two nodes on a graph based on the topology, i.e., the nodes and the links of the graph. To illustrate SimRank, let's consider the following graph, in which a, b, c connect to each other, and d is connected to d. How a node a is similar to a node d, is based on how a's neighbor nodes, b and c, similar to d's neighbors, c.
+-------+
| |
a---b---c---d
As seen, this is a recursive definition. Thus, SimRank is recursively computed until the similarity values converges. Note that SimRank introduces a constant r to represents the relative importance between in-direct neighbors and direct neighbors. The formal equation of SimRank can be found here.
The following function takes a networkx graph $G$ and the relative imporance parameter r as input, and returns the simrank similarity value sim between any two nodes in G. The return value sim is a dictionary of dictionary of float. To access the similarity between node a and node b in graph G, one can simply access sim[a][b].
def simrank(G, r=0.9, max_iter=100):
# init. vars
sim_old = defaultdict(list)
sim = defaultdict(list)
for n in G.nodes():
sim[n] = defaultdict(int)
sim[n][n] = 1
sim_old[n] = defaultdict(int)
sim_old[n][n] = 0
# recursively calculate simrank
for iter_ctr in range(max_iter):
if _is_converge(sim, sim_old):
break
sim_old = copy.deepcopy(sim)
for u in G.nodes():
for v in G.nodes():
if u == v:
continue
s_uv = 0.0
for n_u in G.neighbors(u):
for n_v in G.neighbors(v):
s_uv += sim_old[n_u][n_v]
sim[u][v] = (r * s_uv / (len(G.neighbors(u)) * len(G.neighbors(v))))
return sim
def _is_converge(s1, s2, eps=1e-4):
for i in s1.keys():
for j in s1[i].keys():
if abs(s1[i][j] - s2[i][j]) >= eps:
return False
return True
To calculate the similarity values between nodes in the above graph, you can try this.
>> G = networkx.Graph()
>> G.add_edges_from([('a','b'), ('b', 'c'), ('c','a'), ('c','d')])
>> simrank(G)
You'll get
defaultdict(<type 'list'>, {'a': defaultdict(<type 'int'>, {'a': 0, 'c': 0.62607626807407868, 'b': 0.65379221101693585, 'd': 0.7317028881451203}), 'c': defaultdict(<type 'int'>, {'a': 0.62607626807407868, 'c': 0, 'b': 0.62607626807407868, 'd': 0.53653543888775579}), 'b': defaultdict(<type 'int'>, {'a': 0.65379221101693585, 'c': 0.62607626807407868, 'b': 0, 'd': 0.73170288814512019}), 'd': defaultdict(<type 'int'>, {'a': 0.73170288814512019, 'c': 0.53653543888775579, 'b': 0.73170288814512019, 'd': 0})})
Let's verify the result by calculating similarity between, say, node a and node b, denoted by S(a,b).
S(a,b) = r * (S(b,a)+S(b,c)+S(c,a)+S(c,c))/(2*2) = 0.9 * (0.6538+0.6261+0.6261+1)/4 = 0.6538,
which is the same as our calculated S(a,b) above.
For more details, you may want to checkout the following paper:
G. Jeh and J. Widom. SimRank: a measure of structural-context similarity. In KDD'02 pages 538-543. ACM Press, 2002.
No, simrank is not implemented in networkx.
If you were to add this to networkx, you could shorten the code given by user1036719 by using numpy
and itertools
:
def simrank(G, r=0.8, max_iter=100, eps=1e-4):
nodes = G.nodes()
nodes_i = {k: v for(k, v) in [(nodes[i], i) for i in range(0, len(nodes))]}
sim_prev = numpy.zeros(len(nodes))
sim = numpy.identity(len(nodes))
for i in range(max_iter):
if numpy.allclose(sim, sim_prev, atol=eps):
break
sim_prev = numpy.copy(sim)
for u, v in itertools.product(nodes, nodes):
if u is v:
continue
u_ns, v_ns = G.predecessors(u), G.predecessors(v)
# evaluating the similarity of current iteration nodes pair
if len(u_ns) == 0 or len(v_ns) == 0:
# if a node has no predecessors then setting similarity to zero
sim[nodes_i[u]][nodes_i[v]] = 0
else:
s_uv = sum([sim_prev[nodes_i[u_n]][nodes_i[v_n]] for u_n, v_n in itertools.product(u_ns, v_ns)])
sim[nodes_i[u]][nodes_i[v]] = (r * s_uv) / (len(u_ns) * len(v_ns))
return sim
Then, taking the toy example from the SimRank paper (University graph), reproduces the paper results:
G = networkx.DiGraph()
G.add_edges_from([('1','2'), ('1', '4'), ('2','3'), ('3','1'), ('4', '5'), ('5', '4')])
pprint(simrank(G).round(3))
Which outputs:
array([[ 1. , 0. , 0. , 0.034, 0.132],
[ 0. , 1. , 0. , 0.331, 0.042],
[ 0. , 0. , 1. , 0.106, 0.414],
[ 0.034, 0.331, 0.106, 1. , 0.088],
[ 0.132, 0.042, 0.414, 0.088, 1. ]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With