I have lengthy computations which I repeat many times. Therefore, I would like to use memoization (packages such as jug and joblib), in concert with Pandas. The problem is whether the package would memoize well Pandas DataFrames as method arguments.
Has anyone tried it? Is there any other recommended package/way to do this?
Definition of Memoization. Memoization is an efficient software optimization technique used to speed up programs. It allows you to optimize a python function by catching its output based on the supplied input parameters. Memoization ensures that a method runs for the same input only once.
pandas provides data structures for in-memory analytics, which makes using pandas to analyze datasets that are larger than memory datasets somewhat tricky. Even datasets that are a sizable fraction of memory become unwieldy, as some pandas operations need to make intermediate copies.
Cython (writing C extensions for pandas) For many use cases writing pandas in pure Python and NumPy is sufficient. In some computationally heavy applications however, it can be possible to achieve sizable speed-ups by offloading work to cython.
Author of jug here: jug works fine. I just tried the following and it works:
from jug import TaskGenerator
import pandas as pd
import numpy as np
@TaskGenerator
def gendata():
return pd.DataFrame(np.arange(343440).reshape((10,-1)))
@TaskGenerator
def compute(x):
return x.mean()
y = compute(gendata())
It is not as efficient as it could be as it just uses pickle
internally for the DataFrame
(although it compresses it on the fly, so it is not horrible in terms of memory use; just slower than it could be).
I would be open to a change which saves these as a special case as jug currently does for numpy arrays: https://github.com/luispedro/jug/blob/master/jug/backends/file_store.py#L102
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With