Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas HDFStore of MultiIndex DataFrames: how to efficiently get all indexes

In Pandas, is there a way to efficiently pull out all the MultiIndex indexes present in an HDFStore in table format?

I can select() efficiently using where=, but I want all indexes, and none of the columns. I can also select() using iterator=True to save RAM, but that still means reading pretty much all the table from disk, so it's still slow.

I have been hunting in the store.root..table.* stuff, hoping that I can get a list of index values. Am I on the right track?

Plan B would be to keep a shorter MultiIndex DataFrame that just contains empty DataFrames appended every time I append the main one. I can retrieve that and get the index much more cheaply than the main one. Inelegant though.

like image 800
Tony Avatar asked Jul 15 '13 10:07

Tony


1 Answers

Create a multi-index df

In [35]: df = DataFrame(randn(100000,3),columns=list('ABC'))

In [36]: df['one'] = 'foo'

In [37]: df['two'] = 'bar'

In [38]: df.ix[50000:,'two'] = 'bah'

In [40]: mi = df.set_index(['one','two'])

In [41]: mi
Out[41]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (foo, bar) to (foo, bah)
Data columns (total 3 columns):
A    100000  non-null values
B    100000  non-null values
C    100000  non-null values
dtypes: float64(3)

Store it as a table

In [42]: store = pd.HDFStore('test.h5',mode='w')

In [43]: store.append('df',mi)

get_storer will return the stored object (but not retrieve the data)

In [44]: store.get_storer('df').levels
Out[44]: ['one', 'two']

In [2]: store
Out[2]: 
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df            frame_table  (typ->appendable_multi,nrows->100000,ncols->5,indexers->[index],dc->[two,one])

The index levels are created as data_columns, meaning you can use them in selections This is how to select only the index

In [48]: store.select('df',columns=['one'])
Out[48]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (foo, bar) to (foo, bah)
Empty DataFrame

To select a single column and return it as a mi-frame

In [49]: store.select('df',columns=['A'])
Out[49]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (foo, bar) to (foo, bah)
Data columns (total 1 columns):
A    100000  non-null values
dtypes: float64(1)

To select a single column as a Series (which can also be the index as these are stored as columns). This will be quite fast.

In [2]: store.select_column('df','one')
Out[2]: 
0     foo
1     foo
2     foo
3     foo
4     foo
5     foo
6     foo
7     foo
8     foo
9     foo
10    foo
11    foo
12    foo
13    foo
14    foo
...
99985    foo
99986    foo
99987    foo
99988    foo
99989    foo
99990    foo
99991    foo
99992    foo
99993    foo
99994    foo
99995    foo
99996    foo
99997    foo
99998    foo
99999    foo
Length: 100000, dtype: object

If you really want the fastest selection of only the index

In [4]: %timeit store.select_column('df','one')
100 loops, best of 3: 8.71 ms per loop

In [5]: %timeit store.select('df',columns=['one'])
10 loops, best of 3: 43 ms per loop

Or to get a complete index

In [6]: def f():
   ...:     level_1 =  store.select_column('df','one')
   ...:     level_2 =  store.select_column('df','two')
   ...:     return MultiIndex.from_arrays([ level_1, level_2 ])
   ...: 

In [17]: %timeit f()
10 loops, best of 3: 28.1 ms per loop

If you want the values for each levels, a pretty fast way of doing it

In [2]: store.select_column('df','one').unique()
Out[2]: array(['foo'], dtype=object)

In [3]: store.select_column('df','two').unique()
Out[3]: array(['bar', 'bah'], dtype=object)
like image 187
Jeff Avatar answered Sep 20 '22 13:09

Jeff