Is there a fast way to do serialization of a DataFrame?
I have a grid system which can run pandas analysis in parallel. In the end, I want to collect all the results (as a DataFrame) from each grid job and aggregate them into a giant DataFrame.
How can I save data frame in a binary format that can be loaded rapidly?
Several excellent serialization options exist, each with different strengths. A combination of good serialization support for numeric data and Pandas categorical dtypes enable efficient serialization and storage of DataFrames.
To convert the object to a JSON string, then use the Pandas DataFrame. to_json() function. Pandas to_json() is an inbuilt DataFrame function that converts the object to a JSON string. To export pandas DataFrame to a JSON file, then use the to_json() function.
The easiest way is just to use to_pickle (as a pickle), see pickling from the docs api page:
df.to_pickle(file_name)
Another option is to use HDF5, slightly more work to get started but much richer for querying.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With