I have two CSV_files with hundreds of columns and I want to calculate Pearson correlation coefficient and p value for every same columns of two CSV_files. The problem is that when there is a missing data "NaN" in one column, it gives me an error. When ".dropna" removes nan value from columns, sometimes the shapes of X and Y are not equal (based on removed nan values) and I receive this error:
"ValueError: operands could not be broadcast together with shapes (1020,) (1016,)"
Question: If row #8 in one csv in "nan", is there any way to remove the same row from the other csv too and do the analysis for every column based on rows that have values from both csv files?
import pandas as pd
import scipy
import csv
import numpy as np
from scipy import stats
df = pd.read_csv ("D:/Insitu-Daily.csv",header = None)
dg = pd.read_csv ("D:/Model-Daily.csv",header = None)
pearson_corr_set = []
pearson_p_set = []
for i in range(1,df.shape[1]):
X= df[i].dropna(axis=0, how='any')
Y= dg[i].dropna(axis=0, how='any')
[pearson_corr, pearson_p] = scipy.stats.stats.pearsonr(X, Y)
pearson_corr_set = np.append(pearson_corr_set,pearson_corr)
pearson_p_set = np.append(pearson_p_set,pearson_p)
with open('D:/Results.csv','wb') as file:
str1 = ",".join(str(i) for i in np.asarray(pearson_corr_set))
file.write(str1)
file.write('\n')
str1 = ",".join(str(i) for i in np.asarray(pearson_p_set))
file.write(str1)
file.write('\n')
Instead of dropna, try using isnan and boolean indexing:
for i in range(1, df.shape[1]):
df_sub = df[i]
dg_sub = dg[i]
mask = ~np.isnan(df_sub) & ~np.isnan(dg_sub)
# mask array is now true where ith rows of df and dg are NOT nan.
X = df_sub[mask] # this returns a 1D array of length mask.sum()
Y = df_sub[mask]
... your code continues.
Hope that helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With