Answered: It appears that this datatype will not be suited for adding arbitrary strings into hdf5store.
Background
I work with a script which generates single rows of results and appends them to a file on disk in an iterative approach. To speed things up, I decided to use HDF5 containers rather than .csv. A benchmarking then revealed that strings slow HDF5 down. I was told this can be mitigated when converting strings to categorical
dtype.
Issue
I have not been able to append categorical rows with new categories to HDF5. Also, I don't know how to control the dtypes of cat.codes
, which AFAIK can be done somehow.
1 - Create large dataframe with categorical data
import pandas as pd
import numpy as np
from pandas import HDFStore, DataFrame
import random, string
dummy_data = [''.join(random.sample(string.ascii_uppercase, 5)) for i in range(100000)]
df_big = pd.DataFrame(dummy_data, columns = ['Dummy_Data'])
df_big['Dummy_Data'] = df_big['Dummy_Data'].astype('category')
2 - Create one row to append
df_small = pd.DataFrame(['New_category'], columns = ['Dummy_Data'])
df_small['Dummy_Data'] = df_small['Dummy_Data'].astype('category')
3 - Save (1) to HDF and try to append (2)
df_big.to_hdf('h5_file.h5', \
'symbols_dict', format = "table", data_columns = True, append = False, \
complevel = 9, complib ='blosc')
df_small.to_hdf('h5_file.h5', \
'symbols_dict', format = "table", data_columns = True, append = True, \
complevel = 9, complib ='blosc')
This results in the following Exception
ValueError: invalid combinate of [values_axes] on appending data [name->Dummy_Data,cname->Dummy_Data,dtype->int8,kind->integer,shape->(1,)] vs current table [name->Dummy_Data,cname->Dummy_Data,dtype->int32,kind->integer,shape->None]
My fixing attempts
I tried to adjust the dtypes of cat.catcodes
:
df_big['Dummy_Data'] = df_big['Dummy_Data'].cat.codes.astype('int32')
df_small['Dummy_Data'] = df_small['Dummy_Data'].cat.codes.astype('int32')
When I do this, the error disappears, but so does the categorical dtype:
df_test = pd.read_hdf('h5_file.h5', key='symbols_dict')
print df_mydict.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100001 entries, 0 to 0 # The appending worked now
Data columns (total 1 columns):
Dummy_Data 100001 non-null int32 # Categorical dtype gone
dtypes: int32(1) # I need to change dtype of cat.codes of categorical
memory usage: 1.1 MB # Not of categorical itself
In addition, df_small.info()
does not show the dtype of cat.codes
in the first place, which makes it difficult to debug. What am I doing wrong?
Questions
1. How to properly change dtypes of cat.codes
?
2. How to properly append Categorical Data to HDF5 in python?
I am not an expert on this, but as far as I looked at least at h5py module, http://docs.h5py.org/en/latest/high/dataset.html , HDF5 supports Numpy datatypes, which do not include any categorical datatype.
Same for PyTables, which is used by Pandas.
Categories datatype is introduced and used in Pandas datatypes, and is described:
A categorical variable takes on a limited, and usually fixed, number of possible values (categories; levels in R)
So what might be happening is perhaps every time in order to add a new category, you have to somehow re-read all existing categories from hdf5store in order for Pandas to reindex it?
From the docs in general, however, it appears that this datatype will not be suited for adding arbitrary strings into hdf5store, unless you are sure after maybe a couple of additions there will be no new categories.
As additional note, unless your application demands extremely high performance, storing data in SQL might potentially be a better option -- SQL has better support for strings, for one thing. For example, while SQLite was found slower than HDF5 in some test, they didn't include processing strings. Jumping from CSV to HDF5 sounds like jumping from a horsecart to a rocket, but perhaps a car or airplane would work just as well (or better, as it has more options, to stretch the analogy)?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With