Nested numpy arrays in dask and pandas dataframes

Tags:

A common use case in machine/deep learning code that works on image and audio is to load and manipulate large datasets of images or audio segments. Almost always, the entries in these datasets are represented by an image/audio segment and metadata (e.g. class label, training/test instance, etc.).

For instance, in my specific use case of speech recognition, datasets are almost always composed of entries with properties such as:

Speaker ID (string)
Transcript (string)
Test data (bool)
Wav data (numpy array)
Dataset name (string)
...

What is the recommended way of representing such a dataset in pandas and/or dask - emphasis on the wav data (in an image dataset, this would be the image data itself)?

In pandas, with a few tricks, one can nest a numpy array inside a column, but this doesn't serialize well and also won't work with dask. Seems this is an extremely common use-case but I can't find any relevant recommendations.

One can also serialize/deserialize these arrays to binary format (Uber's petastorm does something like this) but this seems to miss the point of libraries such as dask and pandas where automagic serialization is one of the core benefits.

Any practical comments, or suggestions for different methodologies are most welcome.

629

asked Mar 23 '19 15:03

stav

2 Answers

The data organisation that you have does indeed sound an awful lot like an xarray: multi-dimensional data, with regular coordinates along each of the dimensions and variable properties. xarray allows you to operate on your array in a pandas-like fashion (the docs are very detailed, so I won't go into it). Of note, xarray interfaces directly with Dask so that, as you operate on the high-level data structure, you are actually manipulating dask arrays underneath and so can compute out-of-core and/or distributed.

Although inspired by the netCDF hierarchical data representation (typically stored as HDF5 files), there are a number of possible storage options you could use, including zarr which is particularly useful as a cloud format for parallel access of the form Dask would like to use.

answered Oct 16 '22 06:10

mdurant

One (perhaps ugly) way, is to patch pandas and dask parquet API to support multi-dimensional arrays:

# these monkey-patches into the pandas and dask I/O API allow us to save multi-dimensional numpy
# arrays# in parquet format by serializing them into byte arrays

from dask import dataframe as dd
import pandas as pd
from io import BytesIO

def _patched_pd_read_parquet(*args, **kwargs):
    return _orig_pd_read_parquet(*args, **kwargs).applymap(
        lambda val: np.load(BytesIO(val)) if isinstance(val, bytes) else val)
_orig_pd_read_parquet = pd.io.parquet.PyArrowImpl.read
pd.io.parquet.PyArrowImpl.read = _patched_pd_read_parquet

def _serialize_ndarray(arr: np.ndarray) -> bytes:
    if isinstance(arr, np.ndarray):
        with BytesIO() as buf:
            np.save(buf, arr)
            return buf.getvalue()
    return arr

def _deserialize_ndarray(val: bytes) -> np.ndarray:
    return np.load(BytesIO(val)) if isinstance(val, bytes) else val

def _patched_pd_write_parquet(self, df: pd.DataFrame, *args, **kwargs):
    return _orig_pd_write_parquet(self, df.applymap(_serialize_ndarray), *args, **kwargs)
_orig_pd_write_parquet = pd.io.parquet.PyArrowImpl.write
pd.io.parquet.PyArrowImpl.write = _patched_pd_write_parquet

def _patched_dask_read_pyarrow_parquet_piece(*args, **kwargs):
    return _orig_dask_read_pyarrow_parquet_piece(*args, **kwargs).applymap(_deserialize_ndarray)
_orig_dask_read_pyarrow_parquet_piece = dd.io.parquet._read_pyarrow_parquet_piece
dd.io.parquet._read_pyarrow_parquet_piece = _patched_dask_read_pyarrow_parquet_piece

def _patched_dd_write_partition_pyarrow(df: pd.DataFrame, *args, **kwargs):
    return _orig_dd_write_partition_pyarrow(df.applymap(_serialize_ndarray), *args, **kwargs)
_orig_dd_write_partition_pyarrow = dd.io.parquet._write_partition_pyarrow
dd.io.parquet._write_partition_pyarrow = _patched_dd_write_partition_pyarrow

You can then use the tricks specified in the question to get nested arrays in pandas cells (in-memory), and the above will act as a "poor-man's" codec serializing the arrays into byte streams (which different serialization schemes such as parquet can handle)

answered Oct 16 '22 04:10

stav

Related questions
                            
                                Overly large .exe file when using pyinstaller
                            
                                Python bug: null byte in input prompt
                            
                                What are exactly the standard streams if there's no terminal/console window open for the python interpreter?
                            
                                Django admin: Inline straight to second-level relationship
                            
                                PyTorch Linear Algebra Gradients
                            
                                Setting stdout to non-blocking in python
                            
                                difference in predictions between model.predict() and model.predict_generator() in keras
                            
                                Unable to connect to Hive2 using Python
                            
                                How to download pip packages for a different operating system?
                            
                                Why use more than one equal sign in a statement with the same variable?
                            
                                Python socket connect() vs. connect_ex()
                            
                                ENIGMA CATALYST - WARNING: Loader: Refusing to download new treasury data because a download succeeded
                            
                                How do I re-use trained fastai models?
                            
                                Boost.Python return python object which references to existing c++ objects
                            
                                Lambda expression in cython function
                            
                                Why doesn't numpy.zeros allocate all of its memory on creation? And how can I force it to?
                            
                                How to Improve OCR on image with text in different colors and fonts?
                            
                                VSCode python debug: "No module named xx" when using module attribute
                            
                                TensorFlow Horovod: NCCL and MPI
                            
                                Pyspark read csv with schema, header check, and store corrupt records

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Nested numpy arrays in dask and pandas dataframes

Tags:

python

pandas

numpy

dask

stav

People also ask

2 Answers

mdurant

stav

Recent Activity

Donate For Us