My workflow typically involves loading some data, typically from CSV files, into a pandas dataframe, cleansing it, defining what the right data type for each column is, then exporting it to a SQL server. For those situations when a SQL server is not available, what are good alternatives to store the cleansed data and the explicit definition of the data type for each column? <ul> <li>The only real solution I have tested is to export to a sqlite .db file, using the answer here to make sure dates are read as dates.</li> <li>How about Feather, HDF5, Parquet? Pandas supports them but I don't know much about these formats. I have read feather is not recommended for long-term storage (because the API may change? Not clear)</li> <li>I am not sure about using pickle: I understand it's not a secure format, and the API keeps changing and breaking backwards compatibility</li> <li> CSV is not really an option because inferring data types on my data is often a nightmare; when reading the data back into pandas, I'd need to explicitly declare the formats, including the date format, otherwise: <ul> <li>pandas can create columns where one row is dd-mm-yyyy and another row is mm-dd-yyyy (see here). Plus</li> <li>I have many text columns where the first 10k rows seem to be numbers, and the next 100 are text, so most software will infer the column is numeric, then fail on the import. Maybe I'd need to create a function which exports an ancillary file with all the data type definitions, date formats, etc? Feasible but cumbersome.</li> </ul> </li> </ul> UPDATE: This is an interesting comparison, according to which HDF5 was the fastest format: https://medium.com/@bobhaffner/gist-to-medium-test-db3d51b8ba7b I seem to understand that another difference between HDF5 and Parquet is that datetime64 has no direct equivalent in Hdf5. Most people seem to store their dates in HDF5 as ISO-date-formatted (yyyy-mm-dd) strings.

If your data is 2-dimensional table and is for Bigdata processing like Apache Spark, use parquet. HDF5 is not good for handling date/time as you mentioned. If your data has 3 or more dimensions, HDF5 will be a good choice - especially for long-term archiving, portability, and sharing. Apache Feather is the fastest if performance matters.

What format to export pandas dataframe while retaining data types? Not CSV; Sqlite? Parquet?

Tags:

python

pandas

parquet

feather

My workflow typically involves loading some data, typically from CSV files, into a pandas dataframe, cleansing it, defining what the right data type for each column is, then exporting it to a SQL server.

For those situations when a SQL server is not available, what are good alternatives to store the cleansed data and the explicit definition of the data type for each column?

The only real solution I have tested is to export to a sqlite .db file, using the answer here to make sure dates are read as dates.
How about Feather, HDF5, Parquet? Pandas supports them but I don't know much about these formats. I have read feather is not recommended for long-term storage (because the API may change? Not clear)
I am not sure about using pickle: I understand it's not a secure format, and the API keeps changing and breaking backwards compatibility
CSV is not really an option because inferring data types on my data is often a nightmare; when reading the data back into pandas, I'd need to explicitly declare the formats, including the date format, otherwise:
- pandas can create columns where one row is dd-mm-yyyy and another row is mm-dd-yyyy (see here). Plus
- I have many text columns where the first 10k rows seem to be numbers, and the next 100 are text, so most software will infer the column is numeric, then fail on the import. Maybe I'd need to create a function which exports an ancillary file with all the data type definitions, date formats, etc? Feasible but cumbersome.

UPDATE: This is an interesting comparison, according to which HDF5 was the fastest format: https://medium.com/@bobhaffner/gist-to-medium-test-db3d51b8ba7b

I seem to understand that another difference between HDF5 and Parquet is that datetime64 has no direct equivalent in Hdf5. Most people seem to store their dates in HDF5 as ISO-date-formatted (yyyy-mm-dd) strings.

958

asked Mar 25 '19 17:03

Pythonista anonymous

1 Answers

If your data is 2-dimensional table and is for Bigdata processing like Apache Spark, use parquet. HDF5 is not good for handling date/time as you mentioned.

If your data has 3 or more dimensions, HDF5 will be a good choice - especially for long-term archiving, portability, and sharing.

Apache Feather is the fastest if performance matters.

184

answered Sep 20 '22 19:09

HDFEOS.org

Related questions
                            
                                How to work around Out of bounds nanosecond [duplicate]
                            
                                Is it possible to expand the drawable area around the QSlider
                            
                                Error using HoughCircles with 3-channel input
                            
                                What is the difference between slicing in numpy arrays and slicing a list in Python?
                            
                                SQLAlchemy @property causes 'Unknown Field' error in Marshmallow with dump_only
                            
                                Convert a numpy array to iterator
                            
                                XOR-ing and Summing Two Black and White Images
                            
                                Type(1,) returns int expected tuple
                            
                                Keras: Difference between AveragePooling1D layer and GlobalAveragePooling1D layer
                            
                                Selenium Chrome save as pdf change download folder
                            
                                Pandas groupby then drop groups below specified size
                            
                                What is difference between JsonResponse and HttpResponse in django
                            
                                Docker "unsupported locale setting" when running Python container
                            
                                How to enable and disable the logarithmic scale as a viewer in Plotly?
                            
                                Scraping wikipedia table to pandas data frame
                            
                                Python 3 - Get definition path of object
                            
                                Redirect from a Python AWS Lambda with AWS Gateway API Proxy
                            
                                vscode intellisense not working with PyTorch
                            
                                Extract names from string with python Regex
                            
                                Why is the curve of my permutation test analysis not smooth?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With