I would like to read a matrix file something which looks like:
sample sample1 sample2 sample3
sample1 1 0.7 0.8
sample2 0.7 1 0.8
sample3 0.8 0.8 1
I would like to fetch all the pairs that have a value of > 0.8. E.g: sample1,sample3 0.8 sample2,sample3 0.8 etc in a large file .
When I use csv.reader, each line is turning in to a list and keeping track of row and column names makes program dodgy. I would like to know an elegant way of doing it like using numpy or pandas.
Desired output:
sample1,sample3 0.8
sample2,sample3 0.8
1 can be ignored because between same sample, it will be 1 always.
You can mask out the off upper-triangular values with np.triu:
In [11]: df
Out[11]:
sample1 sample2 sample3
sample
sample1 1.0 0.7 0.8
sample2 0.7 1.0 0.8
sample3 0.8 0.8 1.0
In [12]: np.triu(df, 1)
Out[12]:
array([[ 0. , 0.7, 0.8],
[ 0. , 0. , 0.8],
[ 0. , 0. , 0. ]])
In [13]: np.triu(df, 1) >= 0.8
Out[13]:
array([[False, False, True],
[False, False, True],
[False, False, False]], dtype=bool)
Then to extract the index/columns where it's True I think you have to use np.where*:
In [14]: np.where(np.triu(df, 1) >= 0.8)
Out[14]: (array([0, 1]), array([2, 2]))
This gives you an array of first index indices and then column indices (this is the least efficient part of this numpy version):
In [16]: index, cols = np.where(np.triu(df, 1) >= 0.8)
In [17]: [(df.index[i], df.columns[j], df.iloc[i, j]) for i, j in zip(index, cols)]
Out[17]:
[('sample1', 'sample3', 0.80000000000000004),
('sample2', 'sample3', 0.80000000000000004)]
As desired.
*I may be forgetting an easier way to get this last chunk (Edit: the below pandas code does it, but I think there may be another way too.)
You can use the same trick in pandas but with stack to get the index/columns natively:
In [21]: (np.triu(df, 1) >= 0.8) * df
Out[21]:
sample1 sample2 sample3
sample
sample1 0 0 0.8
sample2 0 0 0.8
sample3 0 0 0.0
In [22]: res = ((np.triu(df, 1) >= 0.8) * df).stack()
In [23]: res
Out[23]:
sample
sample1 sample1 0.0
sample2 0.0
sample3 0.8
sample2 sample1 0.0
sample2 0.0
sample3 0.8
sample3 sample1 0.0
sample2 0.0
sample3 0.0
dtype: float64
In [24]: res[res!=0]
Out[24]:
sample
sample1 sample3 0.8
sample2 sample3 0.8
dtype: float64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With