I'm trying to calculate the weighted topological overlap for an adjacency matrix but I cannot figure out how to do it correctly using numpy
. The R
function that does the correct implementation is from WGCNA
(https://www.rdocumentation.org/packages/WGCNA/versions/1.67/topics/TOMsimilarity). The formula for computing this (I THINK) is detailed in equation 4 which I believe is correctly reproduced below.
Does anyone know how to implement this correctly so it reflects the WGCNA version?
Yes, I know about rpy2
but I'm trying to go lightweight on this if possible.
For starters, my diagonal is not 1
and the values have no consistent error from the original (e.g. not all off by x
).
When I computed this in R
, I used the following:
> library(WGCNA, quiet=TRUE)
> df_adj = read.csv("https://pastebin.com/raw/sbAZQsE6", row.names=1, header=TRUE, check.names=FALSE, sep="\t")
> df_tom = TOMsimilarity(as.matrix(df_adj), TOMType="unsigned", TOMDenom="min")
# ..connectivity..
# ..matrix multiplication (system BLAS)..
# ..normalization..
# ..done.
# I've uploaded it to this url: https://pastebin.com/raw/HT2gBaZC
I'm not sure where my code is incorrect. The source code for the R
version is here but it's using C
backend scripts? which is very difficult for me interpret.
Here is my implementation in Python
:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
def get_iris_data():
iris = load_iris()
# Iris dataset
X = pd.DataFrame(iris.data,
index = [*map(lambda x:f"iris_{x}", range(150))],
columns = [*map(lambda x: x.split(" (cm)")[0].replace(" ","_"), iris.feature_names)])
y = pd.Series(iris.target,
index = X.index,
name = "Species")
return X, y
# Get data
X, y = get_iris_data()
# Create an adjacency network
# df_adj = np.abs(X.T.corr()) # I've uploaded this part to this url: https://pastebin.com/raw/sbAZQsE6
df_adj = pd.read_csv("https://pastebin.com/raw/sbAZQsE6", sep="\t", index_col=0)
A_adj = df_adj.values
# Correct TOM from WGCNA for the A_adj
# See above for code
# https://www.rdocumentation.org/packages/WGCNA/versions/1.67/topics/TOMsimilarity
df_tom__wgcna = pd.read_csv("https://pastebin.com/raw/HT2gBaZC", sep="\t", index_col=0)
# My attempt
A = A_adj.copy()
dimensions = A.shape
assert dimensions[0] == dimensions[1]
d = dimensions[0]
# np.fill_diagonal(A, 0)
# Equation (4) from http://dibernardo.tigem.it/files/papers/2008/zhangbin-statappsgeneticsmolbio.pdf
A_tom = np.zeros_like(A)
for i in range(d):
a_iu = A[i]
k_i = a_iu.sum()
for j in range(i+1, d):
a_ju = A[:,j]
k_j = a_ju.sum()
l_ij = np.dot(a_iu, a_ju)
a_ij = A[i,j]
numerator = l_ij + a_ij
denominator = min(k_i, k_j) + 1 - a_ij
w_ij = numerator/denominator
A_tom[i,j] = w_ij
A_tom = (A_tom + A_tom.T)
There is a package called GTOM
(https://github.com/benmaier/gtom) but it is not for weighted adjacencies. The author of GTOM also took a look at this problem (which a much more sophisticated/efficient NumPy
implementation but it's still not producing the expected results).
Does anyone know how to reproduce the WGCNA implementation?
EDIT: 2019.06.20
I've adapted some of the code from @scleronomic and @benmaier with credits in the doc string. The function is available in soothsayer from v2016.06
and on. Hopefully this will allow people to use topological overlap in Python easier instead of only being able to use R.
https://github.com/jolespin/soothsayer/blob/master/soothsayer/networks/networks.py
import numpy as np
import soothsayer as sy
df_adj = sy.io.read_dataframe("https://pastebin.com/raw/sbAZQsE6")
df_tom = sy.networks.topological_overlap_measure(df_adj)
df_tom__wgcna = sy.io.read_dataframe("https://pastebin.com/raw/HT2gBaZC")
np.allclose(df_tom, df_tom__wgcna)
# True
First let's look at the parts of the equation for the case of a binary adjacency matrix a_ij
:
a_ij
: indicates if node i
is connected to node j
k_i
: count of the neighbors of node i
(connectivity)l_ij
: count of the common neighbors of node i
and node j
so w_ij
measures how many of the neighbors of the node with the lower connectivity are also neighbors of the other node (ie. w_ij
measures "their relative inter-connectedness").
My guess is that they define the diagonal of A to be zero instead of one. With this assumption I can reproduce the values of WGCNA.
A[range(d), range(d)] = 0 # Assumption
L = A @ A # Could be done smarter by using the symmetry
K = A.sum(axis=1)
A_tom = np.zeros_like(A)
for i in range(d):
for j in range(i+1, d):
numerator = L[i, j] + A[i, j]
denominator = min(K[i], K[j]) + 1 - A[i, j]
A_tom[i, j] = numerator / denominator
A_tom += A_tom.T
A_tom[range(d), range(d)] = 1 # Set diagonal to 1 by default
A_tom__wgcna = np.array(pd.read_csv("https://pastebin.com/raw/HT2gBaZC",
sep="\t", index_col=0))
print(np.allclose(A_tom, A_tom__wgcna))
An intuition why the diagonal of A should be zero instead of one can be seen for a simple example with a binary A:
Graph Case Zero Case One
B A B C D A B C D
/ \ A 0 1 1 1 A 1 1 1 1
A-----D B 1 0 0 1 B 1 1 0 1
\ / C 1 0 0 1 C 1 0 1 1
C D 1 1 1 0 D 1 1 1 1
The given description of equation 4 explains:
Note that
w_ij = 1
if the node with fewer connections satisfies two conditions:
- (a) all of its neighbors are also neighbors of the other node and
- (b) it is connected to the other node.
In contrast,
w_ij = 0
ifi
andj
are un-connected and the two nodes do not share any neighbors.
So the connection between A-D should fulfill this criterion and be w_14=1
.
What is still missing when applying the formula is that the diagonal values don't match. I set them to one by default. What is the inter-connectedness of a node with itself anyway? A value different than one (or zero, depending on definition) doesn't make sense to me.
Neither Case Zero nor Case One result in w_ii=1
in the simple example.
In Case Zero it would be necessary that k_i+1 == l_ii
, and in Case One it would be necessary that k_i == l_ii+1
, which both seems wrong to me.
So to summarize I would set the diagonal of the adjacency matrix to zero
, use the given equation and set the diagonal of the result to one
by default.
Given adjacency matrix A, its possible to calculate the TOM matrix W without the use of for loops, which speeds up the process tremendously
L = np.dot(A,A)
k = np.sum(A,axis=0); d = len(k); tile = np.tile(k,(d,1))
K = np.min(np.stack((tile,tile.T),axis=2),axis=2)
W = (L + A)/(K + 1 - A); np.fill_diagonal(W, 1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With