I have a MultiIndex DataFrame on which I am selecting interesting cross-sections. The code works, but is slow on large datasets which makes me think I'm doing something wrong. Essentially I have been concatenating multiple cross-sections into a new DataFrame, and I am looking for a better way.
import pandas as pd
import numpy as np
import itertools
# setup dataset
event = ['event0', 'event1', 'event2']
node = ['n0', 'n1', 'n2', 'n3']
config = ['a', 'b']
data = []
for x in itertools.product(*[event, node, config]):
data.append([x[0], x[1], x[2], np.random.randn()])
df = pd.DataFrame(data, columns=['event', 'node', 'config', 'value'])
dfi = df.set_index(['event', 'node'])
print dfi.head(n=12)
which looks like:
config value
event node
event0 n0 a 1.256259
n0 b 0.612465
n1 a 1.593518
n1 b -0.747131
n2 a 0.719973
n2 b 1.063480
n3 a -0.943120
n3 b 2.021804
event1 n0 a -1.427104
n0 b -0.440886
n1 a 0.168212
n1 b -1.084987
I do some analysis which gives me a list of indexes that I care about:
# Find interesting (event,node)
g = df.groupby(['event', 'node'])['value']
gmin = g.min()
idxs = gmin[(gmin<-1.2)].index
print idxs
#idxs = [(u'event1', u'n0'), (u'event1', u'n2'), (u'event2', u'n0')]
Now I just care about the interesting event, node combinations. This is the part which is slow on real data sets. Each .xs
might take 100ms, but they add up:
df2 = pd.concat([dfi.xs(idx) for idx in idxs])
print df2
Which gives the value for every configuration of the interesting (event, node) cross section:
config value
event node
event1 n0 a -1.427104
n0 b -0.440886
n2 a 0.273871
n2 b -1.224801
event2 n0 a -1.297496
n0 b -1.087568
By using df[] & pandas. DataFrame. loc[] you can select multiple columns by names or labels. To select the columns by names, the syntax is df.
You can slice a MultiIndex by providing multiple indexers. You can provide any of the selectors as if you are indexing by label, see Selection by Label, including slices, lists of labels, labels, and boolean indexers. You can use slice(None) to select all the contents of that level.
You'll be much better off using groupby's filter
method (new in 0.12!), which was designed for exactly this purpose:
In [11]: g = df.groupby(['event', 'node'])
In [12]: g.filter(lambda x: x['value'].min() < -1.2)
Out[12]:
event node config value
0 event0 n0 a -1.566442
1 event0 n0 b -1.652915
14 event1 n3 a 1.685070
15 event1 n3 b -3.205499
20 event2 n2 a -3.007079
21 event2 n2 b 0.159409
(My numbers are different, as they were generated randomly!)
You can then set the index to event and node to get your desired result:
In [13]: g.filter(lambda x: x['value'].min() < - 1.2).set_index(['event', 'node'])
Out[13]:
config value
event node
event0 n0 a -1.566442
n0 b -1.652915
event1 n3 a 1.685070
n3 b -3.205499
event2 n2 a -3.007079
n2 b 0.159409
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With