Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PyArrow: Store list of dicts in parquet using nested types

I want to store the following pandas data frame in a parquet file using PyArrow:

import pandas as pd
df = pd.DataFrame({'field': [[{}, {}]]})

The type of the field column is list of dicts:

      field
0  [{}, {}]

I first define the corresponding PyArrow schema:

import pyarrow as pa
schema = pa.schema([pa.field('field', pa.list_(pa.struct([])))])

Then I use from_pandas():

table = pa.Table.from_pandas(df, schema=schema, preserve_index=False)

This throws the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "table.pxi", line 930, in pyarrow.lib.Table.from_pandas
  File "/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 371, in dataframe_to_arrays
    convert_types)]
  File "/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 370, in <listcomp>
    for c, t in zip(columns_to_convert,
  File "/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 366, in convert_column
    return pa.array(col, from_pandas=True, type=ty)
  File "array.pxi", line 177, in pyarrow.lib.array
  File "error.pxi", line 77, in pyarrow.lib.check_status
  File "error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Unknown list item type: struct<>

Am I doing something wrong or is this not supported by PyArrow?

I use pyarrow 0.9.0, pandas 23.4, python 3.6.

like image 672
SergiyKolesnikov Avatar asked Feb 21 '19 22:02

SergiyKolesnikov


People also ask

Does parquet preserve data type?

It preserves type information: Unlike a CSV, parquet files remember what columns are numeric, which are categorical, etc. etc., so when you re-load your data you can be assured it will look the same as it did when you saved it.

Can pandas read parquet?

Because Parquet is an open-source format, there are many different libraries and engines that can be used to read and write the data. Pandas allows you to customize the engine used to read the data from the file if you know which library is best.


1 Answers

According to this Jira issue, reading and writing nested Parquet data with a mix of struct and list nesting levels was implemented in version 2.0.0.

The following example demonstrates the implemented functionality by doing a round trip: pandas data frame -> parquet file -> pandas data frame. PyArrow version used is 3.0.0.

The initial pandas data frame has one filed of type list of dicts and one entry:

                  field
0  [{'a': 1}, {'a': 2}]

Example code:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet

df = pd.DataFrame({'field': [[{'a': 1}, {'a': 2}]]})
schema = pa.schema(
    [pa.field('field', pa.list_(pa.struct([('a', pa.int64())])))])
table_write = pa.Table.from_pandas(df, schema=schema, preserve_index=False)
pyarrow.parquet.write_table(table_write, 'test.parquet')
table_read = pyarrow.parquet.read_table('test.parquet')
table_read.to_pandas()

The output data frame is the same as the input data frame, as it should be:

                  field
0  [{'a': 1}, {'a': 2}]
like image 101
SergiyKolesnikov Avatar answered Oct 12 '22 12:10

SergiyKolesnikov