I am dealing with Python objects containing Pandas DataFrame
and Numpy Series
objects. These can be large, several millions of rows.
E.g.
@dataclass
class MyWorld:
# A lot of DataFrames with millions of rows
samples: pd.DataFrame
addresses: pd.DataFrame
# etc.
I need to cache these objects, and I am hoping to find an efficient and painless way to serialise them, instead of standard pickle.dump()
. Are there any specialised Python serialisers for such objects that would pickle Series
data with some efficient codec and compression automatically? Alternatively, I need to hand construct several Parquet files, but that requires a lot of more manual code to deal with this, and I'd rather avoid that if possible.
Performance here may mean
I am aware of joblib.dump() which does some magic for these kind of objects, but based on the documentation I am not sure if this is relevant anymore.
What about storing huge structures in parquet format while pickling it, this can be automated easily:
import io
from dataclasses import dataclass
import pickle
import numpy as np
import pandas as pd
@dataclass
class MyWorld:
array: np.ndarray
series: pd.Series
frame: pd.DataFrame
@dataclass
class MyWorldParquet:
array: np.ndarray
series: pd.Series
frame: pd.DataFrame
def __getstate__(self):
for key, value in self.__annotations__.items():
if value is np.ndarray:
self.__dict__[key] = pd.DataFrame({"_": self.__dict__[key]})
if value is pd.Series:
self.__dict__[key] = self.__dict__[key].to_frame()
stream = io.BytesIO()
self.__dict__[key].to_parquet(stream)
self.__dict__[key] = stream
return self.__dict__
def __setstate__(self, data):
self.__dict__.update(data)
for key, value in self.__annotations__.items():
self.__dict__[key] = pd.read_parquet(self.__dict__[key])
if value is np.ndarray:
self.__dict__[key] = self.__dict__[key]["_"].values
if value is pd.Series:
self.__dict__[key] = self.__dict__[key][self.__dict__[key].columns[0]]
Off course we will have some trade off between performance and volumetry as reducing the second requires format translation and compression.
Lets create a toy dataset:
N = 5_000_000
data = {
"array": np.random.normal(size=N),
"series": pd.Series(np.random.uniform(size=N), name="w"),
"frame": pd.DataFrame({
"c": np.random.choice(["label-1", "label-2", "label-3"], size=N),
"x": np.random.uniform(size=N),
"y": np.random.normal(size=N)
})
}
We can compare the parquet conversion trade off (about 300 ms more):
%timeit -r 10 -n 1 pickle.dumps(MyWorld(**data))
# 1.57 s ± 162 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
%timeit -r 10 -n 1 pickle.dumps(MyWorldParquet(**data))
# 1.9 s ± 71.3 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
And the volumetry gain (about 40 Mb spared):
len(pickle.dumps(MyWorld(**data))) / 2 ** 20
# 200.28876972198486
len(pickle.dumps(MyWorldParquet(**data))) / 2 ** 20
# 159.13739013671875
Indeed those metrics will strongly depends on the actual dataset to be serialized.
This could help you. Below article describes various ways and one of them is serialization the dataframe.
Medium article with serialization benchmarks.
The above articles would give you an idea i.e. how to approach the problem.
Also, can you try converting the dataframe to json using dj.to_json() and then reloading them back as a df?
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_json.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With