Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OverflowError while saving large Pandas df to hdf

I have a large Pandas dataframe (~15GB, 83m rows) that I am interested in saving as an h5 (or feather) file. One column contains long ID strings of numbers, which should have string/object type. But even when I ensure that pandas parses all columns as object:

df = pd.read_csv('data.csv', dtype=object)
print(df.dtypes)  # sanity check
df.to_hdf('df.h5', 'df')

> client_id                object
  event_id                 object
  account_id               object
  session_id               object
  event_timestamp          object
  # etc...

I get this error:

  File "foo.py", line 14, in <module>
    df.to_hdf('df.h5', 'df')
  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/core/generic.py", line 1996, in to_hdf
    return pytables.to_hdf(path_or_buf, key, self, **kwargs)
  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 279, in to_hdf
    f(store)
  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 273, in <lambda>
    f = lambda store: store.put(key, value, **kwargs)
  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 890, in put
    self._write_to_group(key, value, append=append, **kwargs)
  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 1367, in _write_to_group
    s.write(obj=value, append=append, complib=complib, **kwargs)
  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 2963, in write
    self.write_array('block%d_values' % i, blk.values, items=blk_items)
  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 2730, in write_array
    vlarr.append(value)
  File "/shared_directory/projects/env/lib/python3.6/site-packages/tables/vlarray.py", line 547, in append
    self._append(nparr, nobjects)
  File "tables/hdf5extension.pyx", line 2032, in tables.hdf5extension.VLArray._append
OverflowError: value too large to convert to int

Apparently it is trying to convert this to an int anyway, and failing.

When running df.to_feather() I have a similar issue:

df.to_feather('df.feather')
  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/core/frame.py", line 1892, in to_feather
    to_feather(self, fname)
  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/feather_format.py", line 83, in to_feather
    feather.write_dataframe(df, path)
  File "/shared_directory/projects/env/lib/python3.6/site-packages/pyarrow/feather.py", line 182, in write_feather
    writer.write(df)
  File "/shared_directory/projects/env/lib/python3.6/site-packages/pyarrow/feather.py", line 93, in write
    table = Table.from_pandas(df, preserve_index=False)
  File "pyarrow/table.pxi", line 1174, in pyarrow.lib.Table.from_pandas
  File "/shared_directory/projects/env/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 501, in dataframe_to_arrays
    convert_fields))
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 586, in result_iterator
    yield fs.pop().result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/shared_directory/projects/env/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 487, in convert_column
    raise e
  File "/shared_directory/projects/env/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 481, in convert_column
    result = pa.array(col, type=type_, from_pandas=True, safe=safe)
  File "pyarrow/array.pxi", line 191, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 78, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: ('Could not convert 1542852887489 with type str: tried to convert to double', 'Conversion failed for column session_id with type object')

So:

  1. Is anything that looks like a number forcibly converted to a number in storage?
  2. Could the presence of NaNs affect what's happening here?
  3. Is there an alternative storage solution? What would be the best?
like image 408
Josh Friedlander Avatar asked Jul 17 '19 14:07

Josh Friedlander


2 Answers

Having done some reading on this topic, it seems like the issue is dealing with string-type columns. My string columns contain a mixture of all-number strings and strings with characters. Pandas has the flexible option of keeping strings as an object, without a declared type, but when serializing to hdf5 or feather the content of the column is converted to a type (str or double, say) and cannot be mixed. Both of these libraries fail when confronted with a sufficiently large library of mixed type.

Force-converting my mixed column to strings allowed me to save it in feather, but in HDF5 the file ballooned and the process ended when I ran out of disk space.

Here is an answer in a comparable case where a commenter notes (2 years ago) "This problem is very standard, but solutions are few".

Some background:

String types in Pandas are called object, but this obscures that they may either be pure strings or mixed dtypes (numpy has builtin string types, but Pandas never uses them for text). So the first thing to do in a case like this is to enforce all string cols as string type (with df[col].astype(str)). But even so, in a large enough file (16GB, with long strings) this still failed. Why?

The reason I was encountering this error was that I had data that long and high-entropy (many different unique values) strings. (With low-entropy data, it might have been worthwhile to switch to categorical dtype.) In my case, I realised that I only needed these strings in order to identify rows - so I could replace them with unique integers!

df[col] = df[col].map(dict(zip(df[col].unique(), range(df[col].nunique()))))

Other Solutions:

For text data, there are other recommended solutions than hdf5/feather, including:

  • json
  • msgpack (note that in Pandas 0.25 read_msgpack is deprecated)
  • pickle (which has known security issues, so be careful - but it should be OK for internal storage/transfer of dataframes)
  • parquet, part of the Apache Arrow ecosystem.

Here is an answer from Matthew Rocklin (one of the dask developers) comparing msgpack and pickle. He wrote a broader comparison on his blog.

like image 89
Josh Friedlander Avatar answered Sep 19 '22 13:09

Josh Friedlander


HDF5 is not the suitable solution for this use case. hdf5 is a better solution if you have many dataframes you want to store in a single structure. It has more overhead when opening the file and then it allows you to efficiently load each dataframe and also easily load slices of them. It should be thought of as a file system that stores dataframes.

In the case of a single dataframe of time series event the recommended formats would be one of the Apache Arrow project formats, i.e. feather or parquet. One should think of those as column based (compressed) csv files. The particular trade off between those two is laid out nicely under What are the differences between feather and parquet?.

One particular issue to consider is data types. Since feather is not designed to optimize disk space by compression it can offer support for a larger variety of data types. While parquet tries to provide very efficient compression it can support only a limited subset that would allow it to handle the data compression better.

like image 42
Alex Fish Avatar answered Sep 18 '22 13:09

Alex Fish