Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selecting rows with multi-index columns

Tags:

python

pandas

I have a df with multi-indexed columns, like this:

col = pd.MultiIndex.from_arrays([['one', '', '', 'two', 'two', 'two'],
                                ['a', 'b', 'c', 'd', 'e', 'f']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
data

enter image description here

I want to be able to select all rows where the values in one of the level 1 columns pass a certain test. If there were no multi-index on the columns I would say something like:

data[data['d']<1]

But of course that fails on a multindex. The level 1 indexes are unique, so I don't want to have to specify the level 0 index, just level 1. I'd like to return the table above but missing row 1, where d>1.

like image 275
Andy Wilson Avatar asked May 20 '20 09:05

Andy Wilson


2 Answers

If values are unique in second level id necessary convert mask from one column DataFrame to Series - possible solution with DataFrame.squeeze:

np.random.seed(2019)
col = pd.MultiIndex.from_arrays([['one', '', '', 'two', 'two', 'two'],
                                ['a', 'b', 'c', 'd', 'e', 'f']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)


print (data.xs('d', axis=1, level=1))
        two
0  1.331864
1  0.953490
2 -0.189313
3  0.064969

print (data.xs('d', axis=1, level=1).squeeze())
0    1.331864
1    0.953490
2   -0.189313
3    0.064969
Name: two, dtype: float64

print (data.xs('d', axis=1, level=1).squeeze().lt(1))
0    False
1     True
2     True
3     True
Name: two, dtype: bool

df = data[data.xs('d', axis=1, level=1).squeeze().lt(1)]

Alternative with DataFrame.iloc:

df = data[data.xs('d', axis=1, level=1).iloc[:, 0].lt(1)]

print (df)
        one                           two                    
          a         b         c         d         e         f
1  0.573761  0.287728 -0.235634  0.953490 -1.689625 -0.344943
2  0.016905 -0.514984  0.244509 -0.189313  2.672172  0.464802
3  0.845930 -0.503542 -0.963336  0.064969 -3.205040  1.054969

If working with MultiIndex after select is possible get multiple columns, like here if select by c level:

np.random.seed(2019)
col = pd.MultiIndex.from_arrays([['one', '', '', 'two', 'two', 'two'],
                                ['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)

So first select by DataFrame.xs and compare by DataFrame.lt for <

print (data.xs('c', axis=1, level=1))
                  two
0  1.481278  0.685609
1 -0.235634 -0.344943
2  0.244509  0.464802
3 -0.963336  1.054969

m = data.xs('c', axis=1, level=1).lt(1)
#alternative
#m = data.xs('c', axis=1, level=1) < 1
print (m)
            two
0  False   True
1   True   True
2   True   True
3   True  False

And then test if at least one True per rows by DataFrame.any and filter by boolean indexing:

df1 = data[m.any(axis=1)]
print (df1)
        one                           two                    
          a         b         c         a         b         c
0 -0.217679  0.821455  1.481278  1.331864 -0.361865  0.685609
1  0.573761  0.287728 -0.235634  0.953490 -1.689625 -0.344943
2  0.016905 -0.514984  0.244509 -0.189313  2.672172  0.464802
3  0.845930 -0.503542 -0.963336  0.064969 -3.205040  1.054969

Or test if all Trues per row by DataFrame.any with filtering:

df1 = data[m.all(axis=1)]
print (df1)
        one                           two                    
          a         b         c         a         b         c
1  0.573761  0.287728 -0.235634  0.953490 -1.689625 -0.344943
2  0.016905 -0.514984  0.244509 -0.189313  2.672172  0.464802
like image 191
jezrael Avatar answered Sep 16 '22 14:09

jezrael


Using your supplied data, a combination of xs and squeeze can help with the filtering. This works on the assumption that the level 1 entries are unique, as indicated in your question :

np.random.seed(2019)
col = pd.MultiIndex.from_arrays([['one', '', '', 'two', 'two', 'two'],
                                ['a', 'b', 'c', 'd', 'e', 'f']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
data

       one                                         two
         a          b         c            d            e          f
0   -0.217679   0.821455    1.481278    1.331864    -0.361865   0.685609
1   0.573761    0.287728    -0.235634   0.953490    -1.689625   -0.344943
2   0.016905    -0.514984   0.244509    -0.189313   2.672172    0.464802
3   0.845930    -0.503542   -0.963336   0.064969    -3.205040   1.054969

Say you want to filter for d less than 1 :

#squeeze turns it into a series, making it easy to pass to loc via boolean indexing
condition = data.xs('d',axis=1,level=1).lt(1).squeeze()
#or you could use loc : 
# condition = data.loc(axis=1)[:,'d'].lt(1).squeeze()

data.loc[condition]

        one                                              two
       a             b           c         d            e           f
1   0.573761    0.287728    -0.235634   0.953490    -1.689625   -0.344943
2   0.016905    -0.514984   0.244509    -0.189313   2.672172    0.464802
3   0.845930    -0.503542   -0.963336   0.064969    -3.205040   1.054969
like image 41
sammywemmy Avatar answered Sep 18 '22 14:09

sammywemmy