Pandas - Filter across all columns

2 Answers

Not sure what you desired output is since you didn't provide a sample, but I'll give you my two cents on what I would do:

In[1]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.rand(10,5))  
corr = df.corr()
corr.shape

Out[1]: (5, 5)

Now, let's extract the upper triangle of the correlation matrix (it's symetric), excluding the diagonal. For this we are going to use np.tril, cast this as a boolean, and get the opposite of it using the ~ operator.

In [2]: corr_triu = corr.where(~np.tril(np.ones(corr.shape)).astype(np.bool))
         corr_triu
Out[2]: 
    0         1         2         3         4
0 NaN  0.228763 -0.276406  0.286771 -0.050825
1 NaN       NaN -0.562459 -0.596057  0.540656
2 NaN       NaN       NaN  0.402752  0.042400
3 NaN       NaN       NaN       NaN -0.642285
4 NaN       NaN       NaN       NaN       NaN

Now let's stack this and filter all values that are above 0.3 for example:

In [3]: corr_triu = corr_triu.stack()
        corr_triu[corr_triu > 0.3]
Out[3]: 
1  4    0.540656
2  3    0.402752
dtype: float64

If you want to make it a bit prettier:

In [4]: corr_triu.name = 'Pearson Correlation Coefficient'
        corr_triu.index.names = ['Col1', 'Col2']

In [5]: corr_triu[corr_triu > 0.3].to_frame()
Out[5]: 
           Pearson Correlation Coefficient
Col1 Col2                   
1    4              0.540656
2    3              0.402752

answered Oct 19 '22 01:10

Julien Marrec

There are two ways to go about this:

Suppose:

In [7]: c = np.array([-1,-2,-2,-3,-4,-6,-7,-8])

In [8]: a = np.array([1,2,3,4,6,7,8,9])

In [9]: b = np.array([2,4,6,8,10,12,13,15])

In [10]: c = np.array([-1,-2,-2,-3,-4,-6,-7,-8])

In [11]: corr = np.corrcoef([a,b,c])

In [12]: df = pd.DataFrame(corr)

In [13]: df
Out[13]:
          0         1         2
0  1.000000  0.995350 -0.980521
1  0.995350  1.000000 -0.971724
2 -0.980521 -0.971724  1.000000

Then you can simply:

In [14]: df > 0.5
Out[14]:
       0      1      2
0   True   True  False
1   True   True  False
2  False  False   True

In [15]: df[df > 0.5]
Out[15]:
         0        1    2
0  1.00000  0.99535  NaN
1  0.99535  1.00000  NaN
2      NaN      NaN  1.0

If you want only the values, then the easiest way is to work with the underlying numpy data structures using the values attribute:

In [17]: df.values
Out[17]:
array([[ 1.        ,  0.99535001, -0.9805214 ],
       [ 0.99535001,  1.        , -0.97172394],
       [-0.9805214 , -0.97172394,  1.        ]])

In [18]: df.values[(df > 0.5).values]
Out[18]: array([ 1.        ,  0.99535001,  0.99535001,  1.        ,  1.        ])

Instead of .values, as pointed out by ayhan, you can use stack which automatically drops NaN and also keeps labels...

In [22]: df.index = ['a','b','c']

In [23]: df.columns=['a','b','c']

In [24]: df
Out[24]:
          a         b         c
a  1.000000  0.995350 -0.980521
b  0.995350  1.000000 -0.971724
c -0.980521 -0.971724  1.000000


In [25]: df.stack() > 0.5
Out[25]:
a  a     True
   b     True
   c    False
b  a     True
   b     True
   c    False
c  a    False
   b    False
   c     True
dtype: bool

In [26]: df.stack()[df.stack() > 0.5]
Out[26]:
a  a    1.00000
   b    0.99535
b  a    0.99535
   b    1.00000
c  c    1.00000
dtype: float64

You can always go back...

In [29]: (df.stack()[df.stack() > 0.5]).unstack()
Out[29]:
         a        b    c
a  1.00000  0.99535  NaN
b  0.99535  1.00000  NaN
c      NaN      NaN  1.0

answered Oct 19 '22 03:10

juanpa.arrivillaga

Related questions
                            
                                BeautifulSoup Decode error
                            
                                Lists are for homogeneous data and tuples are for heterogeneous data... why?
                            
                                Python Tkinter - Set Entry grid width 100%
                            
                                Optimal way to append to numpy array
                            
                                Filter pandas data frame for col == None
                            
                                finding longest path in a graph
                            
                                Django 1.8 TEMPLATE_DIRS being ignored
                            
                                Comparing 2 columns of two Python Pandas dataframes and getting the common rows
                            
                                Django forms - how to override field validation
                            
                                winreg.OpenKey throws filenotfound error for existing registry keys
                            
                                How to create surface plot from greyscale image with Matplotlib?
                            
                                Extend numpy mask by n cells to the right for each bad value, efficiently
                            
                                dict.get(key, default) vs dict.get(key) or default
                            
                                Python: Remove numbers at the beginning of a string
                            
                                Getting <script> and <div> tags from Plotly using Python
                            
                                How to format a multi line string with triple quotes inside using Python?
                            
                                Is `await` in Python3 Cooperative Multitasking?
                            
                                Yum Install libhdf5-dev on Amazon Linux
                            
                                How to mock random.choice in python?
                            
                                Ansible - how to remove an item from a list?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas - Filter across all columns

Tags:

python

pandas

Thomas Murphy

People also ask

2 Answers

Julien Marrec

juanpa.arrivillaga

Recent Activity

Donate For Us