Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to compute the Topological Overlap Measure [TOM] for a weighted adjacency matrix in Python?

I'm trying to calculate the weighted topological overlap for an adjacency matrix but I cannot figure out how to do it correctly using numpy. The R function that does the correct implementation is from WGCNA (https://www.rdocumentation.org/packages/WGCNA/versions/1.67/topics/TOMsimilarity). The formula for computing this (I THINK) is detailed in equation 4 which I believe is correctly reproduced below.

enter image description here

Does anyone know how to implement this correctly so it reflects the WGCNA version?

Yes, I know about rpy2 but I'm trying to go lightweight on this if possible.

For starters, my diagonal is not 1 and the values have no consistent error from the original (e.g. not all off by x).

When I computed this in R, I used the following:

> library(WGCNA, quiet=TRUE)
> df_adj = read.csv("https://pastebin.com/raw/sbAZQsE6", row.names=1, header=TRUE, check.names=FALSE, sep="\t")
> df_tom = TOMsimilarity(as.matrix(df_adj), TOMType="unsigned", TOMDenom="min")
# ..connectivity..
# ..matrix multiplication (system BLAS)..
# ..normalization..
# ..done.
# I've uploaded it to this url: https://pastebin.com/raw/HT2gBaZC

I'm not sure where my code is incorrect. The source code for the R version is here but it's using C backend scripts? which is very difficult for me interpret.

Here is my implementation in Python :

import pandas as pd
import numpy as np
from sklearn.datasets import load_iris

def get_iris_data():
    iris = load_iris()
    # Iris dataset
    X = pd.DataFrame(iris.data,
                     index = [*map(lambda x:f"iris_{x}", range(150))],
                     columns = [*map(lambda x: x.split(" (cm)")[0].replace(" ","_"), iris.feature_names)])

    y = pd.Series(iris.target,
                           index = X.index,
                           name = "Species")
    return X, y

# Get data
X, y = get_iris_data()

# Create an adjacency network
# df_adj = np.abs(X.T.corr()) # I've uploaded this part to this url: https://pastebin.com/raw/sbAZQsE6
df_adj = pd.read_csv("https://pastebin.com/raw/sbAZQsE6", sep="\t", index_col=0)
A_adj = df_adj.values

# Correct TOM from WGCNA for the A_adj
# See above for code
# https://www.rdocumentation.org/packages/WGCNA/versions/1.67/topics/TOMsimilarity
df_tom__wgcna = pd.read_csv("https://pastebin.com/raw/HT2gBaZC", sep="\t", index_col=0)

# My attempt
A = A_adj.copy()
dimensions = A.shape
assert dimensions[0] == dimensions[1]
d = dimensions[0]

# np.fill_diagonal(A, 0)

# Equation (4) from http://dibernardo.tigem.it/files/papers/2008/zhangbin-statappsgeneticsmolbio.pdf
A_tom = np.zeros_like(A)
for i in range(d):
    a_iu = A[i]
    k_i = a_iu.sum()
    for j in range(i+1, d):
        a_ju = A[:,j]
        k_j = a_ju.sum()
        l_ij = np.dot(a_iu, a_ju)
        a_ij = A[i,j]
        numerator = l_ij + a_ij
        denominator = min(k_i, k_j) + 1 - a_ij
        w_ij = numerator/denominator
        A_tom[i,j] = w_ij
A_tom = (A_tom + A_tom.T)

There is a package called GTOM (https://github.com/benmaier/gtom) but it is not for weighted adjacencies. The author of GTOM also took a look at this problem (which a much more sophisticated/efficient NumPy implementation but it's still not producing the expected results).

Does anyone know how to reproduce the WGCNA implementation?

EDIT: 2019.06.20 I've adapted some of the code from @scleronomic and @benmaier with credits in the doc string. The function is available in soothsayer from v2016.06 and on. Hopefully this will allow people to use topological overlap in Python easier instead of only being able to use R.

enter image description here

https://github.com/jolespin/soothsayer/blob/master/soothsayer/networks/networks.py

import numpy as np
import soothsayer as sy
df_adj = sy.io.read_dataframe("https://pastebin.com/raw/sbAZQsE6")
df_tom = sy.networks.topological_overlap_measure(df_adj)
df_tom__wgcna = sy.io.read_dataframe("https://pastebin.com/raw/HT2gBaZC")
np.allclose(df_tom, df_tom__wgcna)
# True
like image 318
O.rka Avatar asked Jun 13 '19 06:06

O.rka


2 Answers

First let's look at the parts of the equation for the case of a binary adjacency matrix a_ij:

  • a_ij: indicates if node i is connected to node j
  • k_i: count of the neighbors of node i (connectivity)
  • l_ij: count of the common neighbors of node i and node j

so w_ij measures how many of the neighbors of the node with the lower connectivity are also neighbors of the other node (ie. w_ij measures "their relative inter-connectedness").

My guess is that they define the diagonal of A to be zero instead of one. With this assumption I can reproduce the values of WGCNA.

A[range(d), range(d)] = 0  # Assumption
L = A @ A  # Could be done smarter by using the symmetry
K = A.sum(axis=1)

A_tom = np.zeros_like(A)
for i in range(d):
    for j in range(i+1, d):  
        numerator = L[i, j] + A[i, j]
        denominator = min(K[i], K[j]) + 1 - A[i, j]
        A_tom[i, j] = numerator / denominator
    
A_tom += A_tom.T
A_tom[range(d), range(d)] = 1  # Set diagonal to 1 by default

A_tom__wgcna = np.array(pd.read_csv("https://pastebin.com/raw/HT2gBaZC", 
                        sep="\t", index_col=0))
print(np.allclose(A_tom, A_tom__wgcna))

An intuition why the diagonal of A should be zero instead of one can be seen for a simple example with a binary A:

 Graph      Case Zero    Case One
   B          A B C D      A B C D  
 /   \      A 0 1 1 1    A 1 1 1 1  
A-----D     B 1 0 0 1    B 1 1 0 1  
 \   /      C 1 0 0 1    C 1 0 1 1  
   C        D 1 1 1 0    D 1 1 1 1  

The given description of equation 4 explains:

Note that w_ij = 1 if the node with fewer connections satisfies two conditions:

  • (a) all of its neighbors are also neighbors of the other node and
  • (b) it is connected to the other node.

In contrast, w_ij = 0 if i and j are un-connected and the two nodes do not share any neighbors.

So the connection between A-D should fulfill this criterion and be w_14=1.

  • Case Zero Diagonal:
  • Case One Diagonal:

What is still missing when applying the formula is that the diagonal values don't match. I set them to one by default. What is the inter-connectedness of a node with itself anyway? A value different than one (or zero, depending on definition) doesn't make sense to me. Neither Case Zero nor Case One result in w_ii=1 in the simple example. In Case Zero it would be necessary that k_i+1 == l_ii, and in Case One it would be necessary that k_i == l_ii+1, which both seems wrong to me.

So to summarize I would set the diagonal of the adjacency matrix to zero, use the given equation and set the diagonal of the result to one by default.

like image 163
scleronomic Avatar answered Oct 12 '22 00:10

scleronomic


Given adjacency matrix A, its possible to calculate the TOM matrix W without the use of for loops, which speeds up the process tremendously

L = np.dot(A,A)
k = np.sum(A,axis=0); d = len(k); tile = np.tile(k,(d,1))
K = np.min(np.stack((tile,tile.T),axis=2),axis=2)
W = (L + A)/(K + 1 - A); np.fill_diagonal(W, 1)
like image 40
Alon Avatar answered Oct 11 '22 23:10

Alon