Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The right way to select multiple cross-sections on a DataFrame

Tags:

pandas

I have a MultiIndex DataFrame on which I am selecting interesting cross-sections. The code works, but is slow on large datasets which makes me think I'm doing something wrong. Essentially I have been concatenating multiple cross-sections into a new DataFrame, and I am looking for a better way.

The dataset

import pandas as pd
import numpy as np
import itertools

# setup dataset
event = ['event0', 'event1', 'event2']
node = ['n0', 'n1', 'n2', 'n3']
config = ['a', 'b']
data = []
for x in itertools.product(*[event, node, config]):
    data.append([x[0], x[1], x[2], np.random.randn()])
df = pd.DataFrame(data, columns=['event', 'node', 'config', 'value'])
dfi = df.set_index(['event', 'node'])
print dfi.head(n=12)

which looks like:

            config     value
event  node
event0 n0        a  1.256259
       n0        b  0.612465
       n1        a  1.593518
       n1        b -0.747131
       n2        a  0.719973
       n2        b  1.063480
       n3        a -0.943120
       n3        b  2.021804
event1 n0        a -1.427104
       n0        b -0.440886
       n1        a  0.168212
       n1        b -1.084987

Some Analysis

I do some analysis which gives me a list of indexes that I care about:

# Find interesting (event,node) 
g = df.groupby(['event', 'node'])['value']
gmin = g.min()
idxs = gmin[(gmin<-1.2)].index
print idxs
#idxs = [(u'event1', u'n0'), (u'event1', u'n2'), (u'event2', u'n0')]

And the clumsy cross-sections

Now I just care about the interesting event, node combinations. This is the part which is slow on real data sets. Each .xs might take 100ms, but they add up:

df2 = pd.concat([dfi.xs(idx) for idx in idxs]) 
print df2

Which gives the value for every configuration of the interesting (event, node) cross section:

            config     value
event  node
event1 n0        a -1.427104
       n0        b -0.440886
       n2        a  0.273871
       n2        b -1.224801
event2 n0        a -1.297496
       n0        b -1.087568

References

  • A similar question recommends a Panel. I have not been able to figure out the right indexes to make this work.
like image 438
chip Avatar asked Aug 22 '13 19:08

chip


People also ask

How do you select multiple columns in Loc pandas?

By using df[] & pandas. DataFrame. loc[] you can select multiple columns by names or labels. To select the columns by names, the syntax is df.

How do you slice a MultiIndex DataFrame?

You can slice a MultiIndex by providing multiple indexers. You can provide any of the selectors as if you are indexing by label, see Selection by Label, including slices, lists of labels, labels, and boolean indexers. You can use slice(None) to select all the contents of that level.


1 Answers

You'll be much better off using groupby's filter method (new in 0.12!), which was designed for exactly this purpose:

In [11]: g = df.groupby(['event', 'node'])

In [12]: g.filter(lambda x: x['value'].min() < -1.2)
Out[12]: 
     event node config     value
0   event0   n0      a -1.566442
1   event0   n0      b -1.652915
14  event1   n3      a  1.685070
15  event1   n3      b -3.205499
20  event2   n2      a -3.007079
21  event2   n2      b  0.159409

(My numbers are different, as they were generated randomly!)

You can then set the index to event and node to get your desired result:

In [13]: g.filter(lambda x: x['value'].min() < - 1.2).set_index(['event', 'node'])
Out[13]: 
            config     value
event  node                 
event0 n0        a -1.566442
       n0        b -1.652915
event1 n3        a  1.685070
       n3        b -3.205499
event2 n2        a -3.007079
       n2        b  0.159409
like image 180
Andy Hayden Avatar answered Nov 09 '22 04:11

Andy Hayden