My workflow typically involves loading some data, typically from CSV files, into a pandas dataframe, cleansing it, defining what the right data type for each column is, then exporting it to a SQL server.
For those situations when a SQL server is not available, what are good alternatives to store the cleansed data and the explicit definition of the data type for each column?
How about Feather, HDF5, Parquet? Pandas supports them but I don't know much about these formats. I have read feather is not recommended for long-term storage (because the API may change? Not clear)
I am not sure about using pickle: I understand it's not a secure format, and the API keeps changing and breaking backwards compatibility
CSV is not really an option because inferring data types on my data is often a nightmare; when reading the data back into pandas, I'd need to explicitly declare the formats, including the date format, otherwise:
UPDATE: This is an interesting comparison, according to which HDF5 was the fastest format: https://medium.com/@bobhaffner/gist-to-medium-test-db3d51b8ba7b
I seem to understand that another difference between HDF5 and Parquet is that datetime64 has no direct equivalent in Hdf5. Most people seem to store their dates in HDF5 as ISO-date-formatted (yyyy-mm-dd) strings.
HDF5 stores data in binary format native to a computing platform but portable across platforms. The binary format native to computers makes the format the more efficient for computers than text formats (e.g., . txt or . csv) that is meant for humans to read.
With its column-oriented design, Parquet brings many efficient storage characteristics (e.g., blocks, row group, column chunks) into the fold. Additionally, it is built to support very efficient compression and encoding schemes for realizing space-saving data pipelines.
If your data is 2-dimensional table and is for Bigdata processing like Apache Spark, use parquet. HDF5 is not good for handling date/time as you mentioned.
If your data has 3 or more dimensions, HDF5 will be a good choice - especially for long-term archiving, portability, and sharing.
Apache Feather is the fastest if performance matters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With