Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyarrow parquet - encoding array into list of records

I am creating parquet files using Pandas and pyarrow and then reading schema of those files using Java (org.apache.parquet.avro.AvroParquetReader).

I found out that parquet files created using pandas + pyarrow always encode arrays of primitive types using an array of records with single field.

I observed same behaviour when using PySpark. There is similar question here Spark writing Parquet array<string> converts to a different datatype when loading into BigQuery

Here is the python script to create parquet file:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


df = pd.DataFrame(
  {
    'organizationId' : ['org1', 'org2', 'org3'],
    'entityType' : ['customer', 'customer', 'customer'],
    'entityId' : ['cust_1', 'cust_2', 'cust_3'],
    'customerProducts' : [['p1', 'p2'], ['p4', 'p5'], ['p1', 'p3']]
  }
)

table = pa.Table.from_pandas(df)
pq.write_table(table, 'output.parquet')

When I try to read Avro schema of that parquet file I see the following schema for 'customerProducts' field:

{"type":"array","items":{"type":"record","name":"list","fields":[{"name":"item","type":["null","string"],"default":null}]}}

but I would expect something this:

{"type":"array","type":["null","string"],"default":null}]}}

Anyone knows if there is a way to make sure that created parquet files with arrays of primitive types will have simplest schema possible?

thanks

like image 220
anthony Avatar asked Jan 20 '26 09:01

anthony


1 Answers

As far as I know the parquet data model follows the capacitor data model which allows a column to be one of three types:

  1. Required
  2. optional
  3. repeated.

In order to represent a list the nested type is needed to add an additional level of indirection to distinguish between empty lists and lists containing only null values.

like image 153
Micah Kornfield Avatar answered Jan 22 '26 21:01

Micah Kornfield



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!