Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Nested numpy arrays in dask and pandas dataframes

A common use case in machine/deep learning code that works on image and audio is to load and manipulate large datasets of images or audio segments. Almost always, the entries in these datasets are represented by an image/audio segment and metadata (e.g. class label, training/test instance, etc.).

For instance, in my specific use case of speech recognition, datasets are almost always composed of entries with properties such as:

  • Speaker ID (string)
  • Transcript (string)
  • Test data (bool)
  • Wav data (numpy array)
  • Dataset name (string)
  • ...

What is the recommended way of representing such a dataset in pandas and/or dask - emphasis on the wav data (in an image dataset, this would be the image data itself)?

In pandas, with a few tricks, one can nest a numpy array inside a column, but this doesn't serialize well and also won't work with dask. Seems this is an extremely common use-case but I can't find any relevant recommendations.

One can also serialize/deserialize these arrays to binary format (Uber's petastorm does something like this) but this seems to miss the point of libraries such as dask and pandas where automagic serialization is one of the core benefits.

Any practical comments, or suggestions for different methodologies are most welcome.

like image 629
stav Avatar asked Mar 23 '19 15:03

stav


People also ask

Can NumPy arrays be nested?

Generally, nested NumPy arrays of NumPy arrays are not very useful. If you are using NumPy for speed, usually it is best to stick with NumPy arrays with a homogenous, basic numeric dtype.

Can you use NumPy and pandas together?

You don't have to import numpy. Numpy and pandas are two different packages. They are both powerful libraries to edit data efficiently, and they are working together pretty good. This is why people use them together.

What is the difference between NumPy array and pandas DataFrame?

Numpy is memory efficient. Pandas has a better performance when a number of rows is 500K or more. Numpy has a better performance when number of rows is 50K or less. Indexing of the pandas series is very slow as compared to numpy arrays.

Is NumPy array faster than pandas DataFrame?

NumPy performs better than Pandas for 50K rows or less. But, Pandas' performance is better than NumPy's for 500K rows or more. Thus, performance varies between 50K and 500K rows depending on the type of operation.


2 Answers

The data organisation that you have does indeed sound an awful lot like an xarray: multi-dimensional data, with regular coordinates along each of the dimensions and variable properties. xarray allows you to operate on your array in a pandas-like fashion (the docs are very detailed, so I won't go into it). Of note, xarray interfaces directly with Dask so that, as you operate on the high-level data structure, you are actually manipulating dask arrays underneath and so can compute out-of-core and/or distributed.

Although inspired by the netCDF hierarchical data representation (typically stored as HDF5 files), there are a number of possible storage options you could use, including zarr which is particularly useful as a cloud format for parallel access of the form Dask would like to use.

like image 90
mdurant Avatar answered Oct 16 '22 06:10

mdurant


One (perhaps ugly) way, is to patch pandas and dask parquet API to support multi-dimensional arrays:

# these monkey-patches into the pandas and dask I/O API allow us to save multi-dimensional numpy
# arrays# in parquet format by serializing them into byte arrays

from dask import dataframe as dd
import pandas as pd
from io import BytesIO

def _patched_pd_read_parquet(*args, **kwargs):
    return _orig_pd_read_parquet(*args, **kwargs).applymap(
        lambda val: np.load(BytesIO(val)) if isinstance(val, bytes) else val)
_orig_pd_read_parquet = pd.io.parquet.PyArrowImpl.read
pd.io.parquet.PyArrowImpl.read = _patched_pd_read_parquet

def _serialize_ndarray(arr: np.ndarray) -> bytes:
    if isinstance(arr, np.ndarray):
        with BytesIO() as buf:
            np.save(buf, arr)
            return buf.getvalue()
    return arr

def _deserialize_ndarray(val: bytes) -> np.ndarray:
    return np.load(BytesIO(val)) if isinstance(val, bytes) else val

def _patched_pd_write_parquet(self, df: pd.DataFrame, *args, **kwargs):
    return _orig_pd_write_parquet(self, df.applymap(_serialize_ndarray), *args, **kwargs)
_orig_pd_write_parquet = pd.io.parquet.PyArrowImpl.write
pd.io.parquet.PyArrowImpl.write = _patched_pd_write_parquet

def _patched_dask_read_pyarrow_parquet_piece(*args, **kwargs):
    return _orig_dask_read_pyarrow_parquet_piece(*args, **kwargs).applymap(_deserialize_ndarray)
_orig_dask_read_pyarrow_parquet_piece = dd.io.parquet._read_pyarrow_parquet_piece
dd.io.parquet._read_pyarrow_parquet_piece = _patched_dask_read_pyarrow_parquet_piece

def _patched_dd_write_partition_pyarrow(df: pd.DataFrame, *args, **kwargs):
    return _orig_dd_write_partition_pyarrow(df.applymap(_serialize_ndarray), *args, **kwargs)
_orig_dd_write_partition_pyarrow = dd.io.parquet._write_partition_pyarrow
dd.io.parquet._write_partition_pyarrow = _patched_dd_write_partition_pyarrow

You can then use the tricks specified in the question to get nested arrays in pandas cells (in-memory), and the above will act as a "poor-man's" codec serializing the arrays into byte streams (which different serialization schemes such as parquet can handle)

like image 20
stav Avatar answered Oct 16 '22 04:10

stav