I have a problem where I have huge dataset like below (Correl Coef matrix)
A B C D E
A 1, 0.413454352,0.615350574,0.479720098,0.34261232
B 0.413454352,1, 0.568124328,0.316543449,0.361164436
C 0.615350574,0.568124328,1, 0.633182519,0.790921334
D 0.479720098,0.316543449,0.633182519,1, 0.450248008
E 0.34261232, 0.361164436,0.790921334,0.450248008,1
I want to fetch all the values in this data frame where cell value is greater than 0.6 it should be along with row name and column name like below
row_name col_name value
1 A C 0.61
2 C A 0.61
3 C D 0.63
3 C E 0.79
4 D C 0.63
5 E C 0.79
If we can also ignore either (A,C) or (C,A) ..it would be much better.
I know I can do it using for loop but that method is not efficient for large data set.
Here's a NumPy based approach -
# Extract values and row, column names
arr = df.values
index_names = df.index
col_names = df.columns
# Get indices where such threshold is crossed; avoid diagonal elems
R,C = np.where(np.triu(arr,1)>0.6)
# Arrange those in columns and put out as a dataframe
out_arr = np.column_stack((index_names[R],col_names[C],arr[R,C]))
df_out = pd.DataFrame(out_arr,columns=[['row_name','col_name','value']])
Sample run -
In [139]: df
Out[139]:
A B C D E
P 1.000000 0.031388 0.263606 0.121490 0.628969
Q 0.031388 1.000000 0.963510 0.497828 0.955238
R 0.263606 0.963510 1.000000 0.917935 0.520522
S 0.121490 0.497828 0.917935 1.000000 0.728386
T 0.628969 0.955238 0.520522 0.728386 1.000000
In [140]: df_out
Out[140]:
row_name col_name value
0 P E 0.628969
1 Q C 0.96351
2 Q E 0.955238
3 R D 0.917935
4 S E 0.728386
UPDATE: using @Divakar's solution and his hints:
In [186]: df = pd.DataFrame(np.triu(df, 1), columns=df.columns, index=df.index)
In [187]: df
Out[187]:
A B C D E
A 0.0 0.413454 0.615351 0.479720 0.342612
B 0.0 0.000000 0.568124 0.316543 0.361164
C 0.0 0.000000 0.000000 0.633183 0.790921
D 0.0 0.000000 0.000000 0.000000 0.450248
E 0.0 0.000000 0.000000 0.000000 0.000000
In [188]: df[df > 0.6].stack().reset_index()
Out[188]:
level_0 level_1 0
0 A C 0.615351
1 C D 0.633183
2 C E 0.790921
OLD answer:
In [96]: df[df > 0.6]
Out[96]:
A B C D E
A 1.000000 NaN 0.615351 NaN NaN
B NaN 1.0 NaN NaN NaN
C 0.615351 NaN 1.000000 0.633183 0.790921
D NaN NaN 0.633183 1.000000 NaN
E NaN NaN 0.790921 NaN 1.000000
In [97]: df[df > 0.6].stack()
Out[97]:
A A 1.000000
C 0.615351
B B 1.000000
C A 0.615351
C 1.000000
D 0.633183
E 0.790921
D C 0.633183
D 1.000000
E C 0.790921
E 1.000000
dtype: float64
or:
In [99]: df[df > 0.6].stack().reset_index()
Out[99]:
level_0 level_1 0
0 A A 1.000000
1 A C 0.615351
2 B B 1.000000
3 C A 0.615351
4 C C 1.000000
5 C D 0.633183
6 C E 0.790921
7 D C 0.633183
8 D D 1.000000
9 E C 0.790921
10 E E 1.000000
data set:
In [100]: df
Out[100]:
A B C D E
A 1.000000 0.413454 0.615351 0.479720 0.342612
B 0.413454 1.000000 0.568124 0.316543 0.361164
C 0.615351 0.568124 1.000000 0.633183 0.790921
D 0.479720 0.316543 0.633183 1.000000 0.450248
E 0.342612 0.361164 0.790921 0.450248 1.000000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With