I am working with large biological dataset.
I want to calculate PCC(Pearson's correlation coefficient) of all 2-column combinations in my data table and save the result as DataFrame or CSV file.
Data table is like below:columns are the name of genes, and rows are the code of dataset. The float numbers mean how much the gene is activated in the dataset.
GeneA GeneB GeneC ...
DataA 1.5 2.5 3.5 ...
DataB 5.5 6.5 7.5 ...
DataC 8.5 8.5 8.5 ...
...
As a output, I want to build the table(DataFrame or csv file) like below, because scipy.stats.pearsonr function returns (PCC, p-value). In my example, XX and YY mean the results of pearsonr([1.5, 5.5, 8.5], [2.5, 6.5, 8.5]). In the same way, ZZ and AA mean the result of pearsonr([1.5, 5.5, 8.5], [3.5, 7.5, 8.5]). I do not need the redundant data such as GeneB_GeneA or GeneC_GeneB in my test.
PCC P-value
GeneA_GeneB XX YY
GeneA_GeneC ZZ AA
GeneB_GeneC BB CC
...
As the number of columns and rows are many(over 100) and their names are complicated, using column names or row names will be difficult.
It might be a simple problem for experts, I do not know how to deal with this kind of table with python and pandas library. Especially making new DataFrame and adding result seems to be very difficult.
Sorry for my poor explanation, but I hope someone could help me.
Use corr() function to find the correlation among the columns in the Dataframe using 'kendall' method. The output Dataframe can be interpreted as for any cell, row variable correlation with the column variable is the value of the cell. As mentioned earlier, the correlation of a variable with itself is 1.
You can use the format cor(X, Y) or rcorr(X, Y) to generate correlations between the columns of X and the columns of Y.
By using corr() function we can get the correlation between two columns in the dataframe.
from pandas import *
import numpy as np
from libraries.settings import *
from scipy.stats.stats import pearsonr
import itertools
Creating random sample data:
df = DataFrame(np.random.random((5, 5)), columns=['gene_' + chr(i + ord('a')) for i in range(5)])
print(df)
gene_a gene_b gene_c gene_d gene_e
0 0.471257 0.854139 0.781204 0.678567 0.697993
1 0.292909 0.046159 0.250902 0.064004 0.307537
2 0.422265 0.646988 0.084983 0.822375 0.713397
3 0.113963 0.016122 0.227566 0.206324 0.792048
4 0.357331 0.980479 0.157124 0.560889 0.973161
correlations = {}
columns = df.columns.tolist()
for col_a, col_b in itertools.combinations(columns, 2):
correlations[col_a + '__' + col_b] = pearsonr(df.loc[:, col_a], df.loc[:, col_b])
result = DataFrame.from_dict(correlations, orient='index')
result.columns = ['PCC', 'p-value']
print(result.sort_index())
PCC p-value
gene_a__gene_b 0.461357 0.434142
gene_a__gene_c 0.177936 0.774646
gene_a__gene_d -0.854884 0.064896
gene_a__gene_e -0.155440 0.802887
gene_b__gene_c -0.575056 0.310455
gene_b__gene_d -0.097054 0.876621
gene_b__gene_e 0.061175 0.922159
gene_c__gene_d -0.633302 0.251381
gene_c__gene_e -0.771120 0.126836
gene_d__gene_e 0.531805 0.356315
DataFrame
columns using
itertools.combination(iterable, r)
scipy.stats.stats.personr
dictionary
DataFrame
from dictionary
You could then also save result.to_csv()
. You might find it convenient to use a MultiIndex
(two columns containing the names of each columns) instead of the created names for the pairwise correlations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With