Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ptrepack sortby needs 'full' index

I am trying to ptrepack a HDF file that was created with pandas HDFStore pytables interface. The main index of the dataframe was time but I made some more columns data_columns so that I can filter for data on-disk via these data_columns.

Now I would like to sort the HDF file by one of those columns (because the selection is too slow for my taste, 84 GB file), using ptrepack with the sortby option like so:

()[maye@luna4 .../nominal]$ ptrepack --chunkshape=auto --propindexes --complevel=9 --complib=blosc --sortby=clat C9.h5 C9_sorted.h5

and I get the error message:

()[maye@luna4 .../nominal]$ Problems doing the copy from 'C9.h5:/' to 'C9_sorted.h5:/' The error was --> : Field clat must have associated a 'full' index in table /df/table (Table(390557601,)) ''. The destination file looks like: C9_sorted.h5 (File) '' Last modif.: 'Fri Jul 26 18:17:56 2013' Object Tree: / (RootGroup) '' /df (Group) '' /df/table (Table(0,), shuffle, blosc(9)) ''

Traceback (most recent call last): File "/usr/local/epd/bin/ptrepack", line 10, in sys.exit(main()) File "/usr/local/epd/lib/python2.7/site-packages/tables/scripts/ptrepack.py", line 480, in main upgradeflavors=upgradeflavors) File "/usr/local/epd/lib/python2.7/site-packages/tables/scripts/ptrepack.py", line 225, in copyChildren raise RuntimeError("Please check that the node names are not " RuntimeError: Please check that the node names are not duplicated in destination, and if so, add the --overwrite-nodes flag if desired. In particular, pay attention that rootUEP is not fooling you.

Does this mean, that I can not sort a HDF file by an index column, because they are not 'full' indexes?

like image 315
K.-Michael Aye Avatar asked Jul 27 '13 01:07

K.-Michael Aye


2 Answers

I have tested several of the options Jeff mentions in our chatty discussions above.

Please have a look at this notebook, hopefully it will help you to make relevant decisions for your data storage: https://nbviewer.ipython.org/810bd0720bb1732067ff The gist for the notebook is here: https://gist.github.com/michaelaye/810bd0720bb1732067ff

My main conclusions:

  • Using index=False has several impressive effects:
    1. It reduces the file size of the resulting HDF file.
    2. It creates the HDF file much faster.
    3. Even so ptdump and the storer().group.table print-out does not show any index, the store display still shows indexers and data-columns (That's probably ignorance of the pytables machinery on my side).
  • Creating an index via store.create_table_index() does nothing yet to the speed of data selection via one of the data-columns.
  • This index HAS to be a 'full' index, so that the later ptrepack with --sortby does not bail. But it does NOT have to be a index level 9. The default level 6 is fine, and does not seem to influence the data selection speed significantly. Maybe it would though with many columns?
  • Using --propindexes almost doubles the ptrepacking time with a slight improvement in data selection speed.
  • Using compression and --propindexs is only slightly slower than using --propindex alone, while the data size (at least in this example) does not go down dramatically.
  • The data selection speed does not seem to be hugely different by having used compression.
  • the speedup for this example of 1 mio. lines of 2 columns random data by just using --sortby without --propindexes is approx factor 5 after sorting for the column of selection.

For completion, the ultra-short summary of commands:

df = pd.DataFrame(randn(1e6,2),columns=list('AB')).to_hdf('test.h5','df',
                  data_columns=list('AB'),mode='w',table=True,index=False)
store = pd.HDFStore('test.h5')
store.create_table_index('df',columns=['B'], kind='full')
store.close()

And in the shell:

ptrepack --chunkshape=auto --sortby=B test.h5 test_sorted.h5
like image 126
K.-Michael Aye Avatar answered Nov 17 '22 02:11

K.-Michael Aye


Here is a complete example.

Create the frame with a data_column. Reset the index to a full index. Use ptrepack to sortby it.

In [16]: df = DataFrame(randn(10,2),columns=list('AB')).to_hdf('test.h5','df',data_columns=['B'],mode='w',table=True)

In [17]: store = pd.HDFStore('test.h5')

In [18]: store
Out[18]: 
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df            frame_table  (typ->appendable,nrows->10,ncols->2,indexers->[index],dc->[B])

In [19]: store.get_storer('df').group.table
Out[19]: 
/df/table (Table(10,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (2730,)
  autoIndex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "B": Index(6, medium, shuffle, zlib(1)).is_CSI=False}

In [20]: store.create_table_index('df',columns=['B'],optlevel=9,kind='full')

In [21]: store.get_storer('df').group.table
Out[21]: 
/df/table (Table(10,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (2730,)
  autoIndex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "B": Index(9, full, shuffle, zlib(1)).is_CSI=True}

 In [22]: store.close()

 In [25]: !ptdump -avd test.h5
/ (RootGroup) ''
  /._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'GROUP',
    PYTABLES_FORMAT_VERSION := '2.0',
    TITLE := '',
    VERSION := '1.0']
/df (Group) ''
  /df._v_attrs (AttributeSet), 14 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := ['B'],
    encoding := None,
    index_cols := [(0, 'index')],
    info := {'index': {}},
    levels := 1,
    nan_rep := b'nan',
    non_index_axes := [(1, ['A', 'B'])],
    pandas_type := b'frame_table',
    pandas_version := b'0.10.1',
    table_type := b'appendable_frame',
    values_cols := ['values_block_0', 'B']]
/df/table (Table(10,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (2730,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "B": Index(9, full, shuffle, zlib(1)).is_csi=True}
  /df/table._v_attrs (AttributeSet), 15 attributes:
   [B_dtype := b'float64',
    B_kind := ['B'],
    CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := 0.0,
    FIELD_1_NAME := 'values_block_0',
    FIELD_2_FILL := 0.0,
    FIELD_2_NAME := 'B',
    NROWS := 10,
    TITLE := '',
    VERSION := '2.6',
    index_kind := b'integer',
    values_block_0_dtype := b'float64',
    values_block_0_kind := ['A']]
  Data dump:
[0] (0, [1.10989047288066], 0.396613633081911)
[1] (1, [0.0981650001268093], -0.9209780702446433)
[2] (2, [-0.2429293157073629], -1.779366453624283)
[3] (3, [0.7305529521507728], 1.243565083939927)
[4] (4, [-0.1480724789512519], 0.5260130757651649)
[5] (5, [1.2560020435792643], 0.5455842491255144)
[6] (6, [1.20129355706986], 0.47930635538027244)
[7] (7, [0.9973598999689721], 0.8602929579025727)
[8] (8, [-0.40070941088441786], 0.7622228032635253)
[9] (9, [0.35865804118145655], 0.29939126149826045)

This is a another way to create a completely sorted index (as opposed to writing it this way)

In [23]: !ptrepack --sortby=B test.h5 test_sorted.h5

In [26]: !ptdump -avd test_sorted.h5
/ (RootGroup) ''
  /._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'GROUP',
    PYTABLES_FORMAT_VERSION := '2.1',
    TITLE := '',
    VERSION := '1.0']
/df (Group) ''
  /df._v_attrs (AttributeSet), 14 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := ['B'],
    encoding := None,
    index_cols := [(0, 'index')],
    info := {'index': {}},
    levels := 1,
    nan_rep := b'nan',
    non_index_axes := [(1, ['A', 'B'])],
    pandas_type := b'frame_table',
    pandas_version := b'0.10.1',
    table_type := b'appendable_frame',
    values_cols := ['values_block_0', 'B']]
/df/table (Table(10,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (2730,)
  /df/table._v_attrs (AttributeSet), 15 attributes:
   [B_dtype := b'float64',
    B_kind := ['B'],
    CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := 0.0,
    FIELD_1_NAME := 'values_block_0',
    FIELD_2_FILL := 0.0,
    FIELD_2_NAME := 'B',
    NROWS := 10,
    TITLE := '',
    VERSION := '2.6',
    index_kind := b'integer',
    values_block_0_dtype := b'float64',
    values_block_0_kind := ['A']]
  Data dump:
[0] (2, [-0.2429293157073629], -1.779366453624283)
[1] (1, [0.0981650001268093], -0.9209780702446433)
[2] (9, [0.35865804118145655], 0.29939126149826045)
[3] (0, [1.10989047288066], 0.396613633081911)
[4] (6, [1.20129355706986], 0.47930635538027244)
[5] (4, [-0.1480724789512519], 0.5260130757651649)
[6] (5, [1.2560020435792643], 0.5455842491255144)
[7] (8, [-0.40070941088441786], 0.7622228032635253)
[8] (7, [0.9973598999689721], 0.8602929579025727)
[9] (3, [0.7305529521507728], 1.243565083939927)
like image 45
Jeff Avatar answered Nov 17 '22 00:11

Jeff