reading a matrix and fetch row and column names in python

Question

I would like to read a matrix file something which looks like:

sample  sample1 sample2 sample3
sample1 1   0.7 0.8
sample2 0.7 1   0.8
sample3 0.8 0.8 1

I would like to fetch all the pairs that have a value of > 0.8. E.g: sample1,sample3 0.8 sample2,sample3 0.8 etc in a large file .

When I use csv.reader, each line is turning in to a list and keeping track of row and column names makes program dodgy. I would like to know an elegant way of doing it like using numpy or pandas.

Desired output:

sample1,sample3 0.8 
sample2,sample3 0.8

1 can be ignored because between same sample, it will be 1 always.

Andy Hayden · Accepted Answer

You can mask out the off upper-triangular values with np.triu:

In [11]: df
Out[11]:
         sample1  sample2  sample3
sample
sample1      1.0      0.7      0.8
sample2      0.7      1.0      0.8
sample3      0.8      0.8      1.0

In [12]: np.triu(df, 1)
Out[12]:
array([[ 0. ,  0.7,  0.8],
       [ 0. ,  0. ,  0.8],
       [ 0. ,  0. ,  0. ]])

In [13]: np.triu(df, 1) >= 0.8
Out[13]:
array([[False, False,  True],
       [False, False,  True],
       [False, False, False]], dtype=bool)

Then to extract the index/columns where it's True I think you have to use np.where*:

In [14]: np.where(np.triu(df, 1) >= 0.8)
Out[14]: (array([0, 1]), array([2, 2]))

This gives you an array of first index indices and then column indices (this is the least efficient part of this numpy version):

In [16]: index, cols = np.where(np.triu(df, 1) >= 0.8)

In [17]: [(df.index[i], df.columns[j], df.iloc[i, j]) for i, j in zip(index, cols)]
Out[17]:
[('sample1', 'sample3', 0.80000000000000004),
 ('sample2', 'sample3', 0.80000000000000004)]

As desired.

*I may be forgetting an easier way to get this last chunk (Edit: the below pandas code does it, but I think there may be another way too.)

You can use the same trick in pandas but with stack to get the index/columns natively:

In [21]: (np.triu(df, 1) >= 0.8) * df
Out[21]:
         sample1  sample2  sample3
sample
sample1        0        0      0.8
sample2        0        0      0.8
sample3        0        0      0.0

In [22]: res = ((np.triu(df, 1) >= 0.8) * df).stack()

In [23]: res
Out[23]:
sample
sample1  sample1    0.0
         sample2    0.0
         sample3    0.8
sample2  sample1    0.0
         sample2    0.0
         sample3    0.8
sample3  sample1    0.0
         sample2    0.0
         sample3    0.0
dtype: float64

In [24]: res[res!=0]
Out[24]:
sample
sample1  sample3    0.8
sample2  sample3    0.8
dtype: float64

reading a matrix and fetch row and column names in python

Tags:

pandas

numpy

python-2.7

gthm

1 Answers

Andy Hayden

Recent Activity

Donate For Us

reading a matrix and fetch row and column names in python

Tags:

pandas

numpy

python-2.7

gthm

1 Answers

Andy Hayden

Related questions

Recent Activity

Donate For Us