In Pandas, is there a way to efficiently pull out all the MultiIndex indexes present in an HDFStore in table format?
I can select()
efficiently using where=
, but I want all indexes, and none of the columns. I can also select()
using iterator=True
to save RAM, but that still means reading pretty much all the table from disk, so it's still slow.
I have been hunting in the store.root..table.* stuff, hoping that I can get a list of index values. Am I on the right track?
Plan B would be to keep a shorter MultiIndex DataFrame that just contains empty DataFrames appended every time I append the main one. I can retrieve that and get the index much more cheaply than the main one. Inelegant though.
Create a multi-index df
In [35]: df = DataFrame(randn(100000,3),columns=list('ABC'))
In [36]: df['one'] = 'foo'
In [37]: df['two'] = 'bar'
In [38]: df.ix[50000:,'two'] = 'bah'
In [40]: mi = df.set_index(['one','two'])
In [41]: mi
Out[41]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (foo, bar) to (foo, bah)
Data columns (total 3 columns):
A 100000 non-null values
B 100000 non-null values
C 100000 non-null values
dtypes: float64(3)
Store it as a table
In [42]: store = pd.HDFStore('test.h5',mode='w')
In [43]: store.append('df',mi)
get_storer
will return the stored object (but not retrieve the data)
In [44]: store.get_storer('df').levels
Out[44]: ['one', 'two']
In [2]: store
Out[2]:
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df frame_table (typ->appendable_multi,nrows->100000,ncols->5,indexers->[index],dc->[two,one])
The index levels are created as data_columns, meaning you can use them in selections This is how to select only the index
In [48]: store.select('df',columns=['one'])
Out[48]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (foo, bar) to (foo, bah)
Empty DataFrame
To select a single column and return it as a mi-frame
In [49]: store.select('df',columns=['A'])
Out[49]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (foo, bar) to (foo, bah)
Data columns (total 1 columns):
A 100000 non-null values
dtypes: float64(1)
To select a single column as a Series (which can also be the index as these are stored as columns). This will be quite fast.
In [2]: store.select_column('df','one')
Out[2]:
0 foo
1 foo
2 foo
3 foo
4 foo
5 foo
6 foo
7 foo
8 foo
9 foo
10 foo
11 foo
12 foo
13 foo
14 foo
...
99985 foo
99986 foo
99987 foo
99988 foo
99989 foo
99990 foo
99991 foo
99992 foo
99993 foo
99994 foo
99995 foo
99996 foo
99997 foo
99998 foo
99999 foo
Length: 100000, dtype: object
If you really want the fastest selection of only the index
In [4]: %timeit store.select_column('df','one')
100 loops, best of 3: 8.71 ms per loop
In [5]: %timeit store.select('df',columns=['one'])
10 loops, best of 3: 43 ms per loop
Or to get a complete index
In [6]: def f():
...: level_1 = store.select_column('df','one')
...: level_2 = store.select_column('df','two')
...: return MultiIndex.from_arrays([ level_1, level_2 ])
...:
In [17]: %timeit f()
10 loops, best of 3: 28.1 ms per loop
If you want the values for each levels, a pretty fast way of doing it
In [2]: store.select_column('df','one').unique()
Out[2]: array(['foo'], dtype=object)
In [3]: store.select_column('df','two').unique()
Out[3]: array(['bar', 'bah'], dtype=object)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With