I want to store the following pandas data frame in a parquet file using PyArrow:
import pandas as pd
df = pd.DataFrame({'field': [[{}, {}]]})
The type of the field
column is list of dicts:
field
0 [{}, {}]
I first define the corresponding PyArrow schema:
import pyarrow as pa
schema = pa.schema([pa.field('field', pa.list_(pa.struct([])))])
Then I use from_pandas()
:
table = pa.Table.from_pandas(df, schema=schema, preserve_index=False)
This throws the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "table.pxi", line 930, in pyarrow.lib.Table.from_pandas
File "/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 371, in dataframe_to_arrays
convert_types)]
File "/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 370, in <listcomp>
for c, t in zip(columns_to_convert,
File "/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 366, in convert_column
return pa.array(col, from_pandas=True, type=ty)
File "array.pxi", line 177, in pyarrow.lib.array
File "error.pxi", line 77, in pyarrow.lib.check_status
File "error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Unknown list item type: struct<>
Am I doing something wrong or is this not supported by PyArrow?
I use pyarrow 0.9.0, pandas 23.4, python 3.6.
It preserves type information: Unlike a CSV, parquet files remember what columns are numeric, which are categorical, etc. etc., so when you re-load your data you can be assured it will look the same as it did when you saved it.
Because Parquet is an open-source format, there are many different libraries and engines that can be used to read and write the data. Pandas allows you to customize the engine used to read the data from the file if you know which library is best.
According to this Jira issue, reading and writing nested Parquet data with a mix of struct and list nesting levels was implemented in version 2.0.0.
The following example demonstrates the implemented functionality by doing a round trip: pandas data frame -> parquet file -> pandas data frame. PyArrow version used is 3.0.0.
The initial pandas data frame has one filed of type list of dicts and one entry:
field
0 [{'a': 1}, {'a': 2}]
Example code:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet
df = pd.DataFrame({'field': [[{'a': 1}, {'a': 2}]]})
schema = pa.schema(
[pa.field('field', pa.list_(pa.struct([('a', pa.int64())])))])
table_write = pa.Table.from_pandas(df, schema=schema, preserve_index=False)
pyarrow.parquet.write_table(table_write, 'test.parquet')
table_read = pyarrow.parquet.read_table('test.parquet')
table_read.to_pandas()
The output data frame is the same as the input data frame, as it should be:
field
0 [{'a': 1}, {'a': 2}]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With