Calculating pairwise correlation among all columns

Tags:

I am working with large biological dataset.

I want to calculate PCC(Pearson's correlation coefficient) of all 2-column combinations in my data table and save the result as DataFrame or CSV file.

Data table is like below:columns are the name of genes, and rows are the code of dataset. The float numbers mean how much the gene is activated in the dataset.

      GeneA GeneB GeneC ...
DataA 1.5 2.5 3.5 ...
DataB 5.5 6.5 7.5 ...
DataC 8.5 8.5 8.5 ...
...

As a output, I want to build the table(DataFrame or csv file) like below, because scipy.stats.pearsonr function returns (PCC, p-value). In my example, XX and YY mean the results of pearsonr([1.5, 5.5, 8.5], [2.5, 6.5, 8.5]). In the same way, ZZ and AA mean the result of pearsonr([1.5, 5.5, 8.5], [3.5, 7.5, 8.5]). I do not need the redundant data such as GeneB_GeneA or GeneC_GeneB in my test.

               PCC P-value
GeneA_GeneB    XX YY
GeneA_GeneC    ZZ AA
GeneB_GeneC    BB CC
...

As the number of columns and rows are many(over 100) and their names are complicated, using column names or row names will be difficult.

It might be a simple problem for experts, I do not know how to deal with this kind of table with python and pandas library. Especially making new DataFrame and adding result seems to be very difficult.

Sorry for my poor explanation, but I hope someone could help me.

360

asked Nov 30 '15 11:11

z991

1 Answers

from pandas import *
import numpy as np
from libraries.settings import *
from scipy.stats.stats import pearsonr
import itertools

Creating random sample data:

df = DataFrame(np.random.random((5, 5)), columns=['gene_' + chr(i + ord('a')) for i in range(5)]) 
print(df)

     gene_a    gene_b    gene_c    gene_d    gene_e
0  0.471257  0.854139  0.781204  0.678567  0.697993
1  0.292909  0.046159  0.250902  0.064004  0.307537
2  0.422265  0.646988  0.084983  0.822375  0.713397
3  0.113963  0.016122  0.227566  0.206324  0.792048
4  0.357331  0.980479  0.157124  0.560889  0.973161

correlations = {}
columns = df.columns.tolist()

for col_a, col_b in itertools.combinations(columns, 2):
    correlations[col_a + '__' + col_b] = pearsonr(df.loc[:, col_a], df.loc[:, col_b])

result = DataFrame.from_dict(correlations, orient='index')
result.columns = ['PCC', 'p-value']

print(result.sort_index())

                     PCC   p-value
gene_a__gene_b  0.461357  0.434142
gene_a__gene_c  0.177936  0.774646
gene_a__gene_d -0.854884  0.064896
gene_a__gene_e -0.155440  0.802887
gene_b__gene_c -0.575056  0.310455
gene_b__gene_d -0.097054  0.876621
gene_b__gene_e  0.061175  0.922159
gene_c__gene_d -0.633302  0.251381
gene_c__gene_e -0.771120  0.126836
gene_d__gene_e  0.531805  0.356315

Get unique combinations of DataFrame columns using itertools.combination(iterable, r)
Iterate through these combinations and calculate pairwise correlations using scipy.stats.stats.personr
Add results (PCC and p-value tuple) to dictionary
Build DataFrame from dictionary

You could then also save result.to_csv(). You might find it convenient to use a MultiIndex (two columns containing the names of each columns) instead of the created names for the pairwise correlations.

107

answered Sep 17 '22 14:09

Stefan

Related questions
                            
                                Rotate a 2D image around specified origin in Python
                            
                                Python Multiprocessing: Only one process is running
                            
                                What's the Pythonic way to report nonfatal errors in a parser?
                            
                                count occurrences of number by column in pandas data frame
                            
                                mat is not a numerical tuple : openCV error
                            
                                Masking user input in python with asterisks
                            
                                get_bucket() gives 'Bad Request' for S3 buckets I didn't create via Boto
                            
                                Adding colors to a 3d quiver plot in matplotlib
                            
                                Traceback when updating status on twitter via Tweepy
                            
                                Pandas selecting discontinuous columns from a dataframe
                            
                                Getting all instances of child node using xml.etree.ElementTree
                            
                                How are the "error bands" in Seaborn tsplot calculated?
                            
                                Plot pandas data frame with year over year data
                            
                                OpenCV remove background
                            
                                How to mutate a list with a function in python?
                            
                                What does the "verbosity" parameter of a random forest mean? (sklearn)
                            
                                How to give foreign key name in django
                            
                                Accessing MySQL from Python 3: Access denied for user
                            
                                Python ASCII codec can't encode character error during write to CSV
                            
                                Tensorflow successfully installs on mac but gets ImportError on copyreg when used [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Calculating pairwise correlation among all columns

Tags:

python

pandas

correlation

z991

People also ask

1 Answers

Stefan

Recent Activity

Donate For Us