I'm trying to apply a function to each pair of columns in a numpy array (each column is an individual's genotype).
For example:
[48]: g[0:10,0:10]
array([[ 1, 1, 1, 1, 1, 1, 1, 1, 1, -1],
[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[ 1, 1, 1, 1, 1, 1, -1, 1, 1, 1],
[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 1, 1, 1, 1, 1, 1, 1, 1, 1, -1],
[-1, -1, 0, -1, -1, -1, -1, -1, -1, 0],
[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int8)
My goal is to produce a distance matrix d so that each element of d is the pairwise distance comparing each column in g.
d[0,1] = func(g[:,0], g[:,1])
Any ideas would be fantastic! Thank you!
You can simply define the function as:
def count_snp_diffs(x, y):
return np.count_nonzero((x != y) & (x >= 0) & (y >= 0),axis=0)
And then call it using as index an array generated with itertools.combinations
, in order to get all possible column combinations:
combinations = np.array(list(itertools.combinations(range(g.shape[1]),2)))
dist = count_snp_diffs(g[:,combinations[:,0]], g[:,combinations[:,1]])
In addition, if the output must be stored in a matrix (which for large g
is not recomendes because only the upper triangle will be filled and all the rest will be useless info, this can be achieved with the same trick:
d = np.zeros((g.shape[1],g.shape[1]))
combinations = np.array(list(itertools.combinations(range(g.shape[1]),2)))
d[combinations[:,0],combinations[:,1]] = count_snp_diffs(g[:,combinations[:,0]], g[:,combinations[:,1]])
Now, d[i,j]
returns the distance between columns i
and j
(whereas d[j,i]
is a zero). This approach relies in the fact that arrays can be indexed with lists or arrays containing repeated indexes:
a = np.arange(3)+4
a[[0,1,1,1,0,2,1,1]]
# Out
# [4, 5, 5, 5, 4, 6, 5, 5]
Here is one step by step explanation of what is happening.
Calling g[:,combinations[:,0]]
accesses all the columns in the first clumn of permutations, generating a new array, which is compared column by column with the array generated with g[:,combinations[:,1]]
. Thus, A boolean array diff
is generated. If g
had 3 columns it would look like this, where each column is the comparison of columns 0,1
, 0,2
and 1,2
:
[[ True False False]
[False True False]
[ True True False]
[False False False]
[False True False]
[False False False]]
And finally, the values for each column are added:
np.count_nonzero(diff,axis=0)
# Out
# [2 3 0]
In addition, due to the fact that the boolean class in python inherits from integer class (roughly False==0
and and True==1
, see this answer of "Is False == 0 and True == 1 in Python an implementation detail or is it guaranteed by the language?" for more info). The np.count_nonzero
adds 1 for each True
position, which is the same result obtained with np.sum
:
np.sum(diff,axis=0)
# Out
# [2 3 0]
For large arrays, working with the whole array at a time can require too much memory and you can get a Memory Error
, however, for small or medium arrays, it tends to be the fastest approach. In some cases it can be useful to work by chunks:
combinations = np.array(list(itertools.combinations(range(g.shape[1]),2)))
n = len(combinations)
dist = np.empty(n)
# B = np.zeros((g.shape[1],g.shape[1]))
chunk = 200
for i in xrange(chunk,n,chunk):
dist[i-chunk:i] = count_snp_diffs(g[:,combinations[i-chunk:i,0]], g[:,combinations[i-chunk:i,1]])
# B[combinations[i-chunk:i,0],combinations[i-chunk:i,1]] = count_snp_diffs(g[:,combinations[i-chunk:i,0]], g[:,combinations[i-chunk:i,1]])
dist[i:] = count_snp_diffs(g[:,combinations[i:,0]], g[:,combinations[i:,1]])
# B[combinations[i:,0],combinations[i:,1]] = count_snp_diffs(g[:,combinations[i:,0]], g[:,combinations[i:,1]])
For g.shape=(300,N)
, the execution times reported by %%timeit
in my computer with python 2.7, numpy 1.14.2 and allel 1.1.10 are:
With these array dimensions, pure numpy is a litle faster than allel module, but the computation time should be checked for the dimensions in your problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With