Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write Parquet metadata with pyarrow?

I use pyarrow to create and analyse Parquet tables with biological information and I need to store some metadata, e.g. which sample the data comes from, how it was obtained and processed.

Parquet seems to support file-wide metadata, but I cannot find how the write it via pyarrow. The closest thing I could find is how to write row-group metadata, but this seems like an overkill, since my metadata is the same for all row groups in the file.

Is there any way to write file-wide Parquet metadata with pyarrow?

like image 766
golobor Avatar asked Aug 31 '18 21:08

golobor


People also ask

Do Parquet files have metadata?

In addition to the data types, Parquet specification also stores metadata which records the schema at three levels; file, chunk(column) and page header. The footer for each file contains the file metadata.

Can pandas write to Parquet?

Pandas provides a beautiful Parquet interface. Pandas leverages the PyArrow library to write Parquet files, but you can also write Parquet files directly from PyArrow.

What is parquet file format example?

What is Parquet? Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.


2 Answers

Pyarrow maps the file-wide metadata to a field in the table's schema named metadata. Regrettably there is not (yet) documentation on this.

Both the Parquet metadata format and the Pyarrow metadata format represent metadata as a collection of key/value pairs where both key & value must be strings. This is unfortunate as it would be more flexible if it were just a UTF-8 encoded JSON object. Furthermore, since these are std::string objects in the C++ implementation they are "b strings" (bytes) objects in Python.

Pyarrow currently stores some of its own information in the metadata field. It has one built in key b'ARROW:schema' and another builtin key b'pandas'. In the pandas case the value is a JSON object encoded with UTF-8. This allows for namespacing. The "pandas" schema can have as many fields as it needs and they are all namespaced under "pandas". Pyarrow uses the "pandas" schema to store information about what kind of index the table has as well as what type of encoding a column uses (when there is more than one possible pandas encoding for a given data type). I am uncertain what the b'ARROW:schema' represents. It appears to be encoded in some way I don't recognize and I have not really played around with it. I assume it's intended to record similar things to the "pandas" schema.

The last thing we need to know to answer your question is that all pyarrow objects are immutable. So there is no way to simply add fields to the schema. Pyarrow does have the schema utility method with_metadata which returns a clone of a schema object but with your own metadata but this replaces the existing metadata and does not append to it. There is also the experimental method on the Table object replace_schema_metadata but this also replaces and does not update. So if you want to keep the existing metadata you have to do some more work. Putting this all together we get...

custom_metadata = {'Sample Number': '12', 'Date Obtained': 'Tuesday'}
existing_metadata = table.schema.metadata
merged_metadata = { **custom_metadata, **existing_metadata }
fixed_table = table.replace_schema_metadata(merged_metadata)

Once this table is saved as a parquet file it will include the key/value metadata fields (at the file level) for Sample Number and Date Obtained.

Also, note that the replace_schema_metadata and with_metadata methods are tolerant of taking in regular python strings (like in my example). However, it will convert these to "b strings" so if you want to access fields in the schema you must use the "b string". For example, if you had just read in a table and wanted to get the sample number you must use table.schema.metadata[b'Sample Number'] and table.schema.metadats['Sample Number'] will give you a KeyError.

As you start to use this you may realize it is a pain to constantly have to be mapping Sample Number back and forth to an integer. Furthermore, if your metadata is represented in your application as a large nested object it can be a pain to map this object to a collection of string/string pairs. Also, it's a pain to constantly be remembering the "b string" keys. The solution is to do the same thing the pandas schema does. First convert your metadata to a JSON object. Then convert the JSON object to a "b string".

custom_metadata_json = {'Sample Number': 12, 'Date Obtained': 'Tuesday'}
custom_metadata_bytes = json.dumps(custom_metadata_json).encode('utf8')
existing_metadata = table.schema.metadata
merged_metadata = { **{'Record Metadata': custom_metadata_bytes}, **existing_metadata }

Now you can have as many metadata fields as you want, nested in any way you want, using any of the standard JSON types and it will all be namespaced into a single key/value pair (in this case named "Record Metadata").

like image 100
Pace Avatar answered Oct 02 '22 16:10

Pace


This example shows how to create a Parquet file with file metadata and column metadata with PyArrow.

Suppose you have the following CSV data:

movie,release_year
three idiots,2009
her,2013

Read the CSV into a PyArrow table and define a custom schema with column / file metadata:

import pyarrow.csv as pv
import pyarrow.parquet as pq
import pyarrow as pa

table = pv.read_csv('movies.csv')

my_schema = pa.schema([
    pa.field("movie", "string", False, metadata={"spanish": "pelicula"}),
    pa.field("release_year", "int64", True, metadata={"portuguese": "ano"})],
    metadata={"great_music": "reggaeton"})

Create a new table with my_schema and write it out as a Parquet file:

t2 = table.cast(my_schema)

pq.write_table(t2, 'movies.parquet')

Read the Parquet file and fetch the file metadata:

s = pq.read_table('movies.parquet').schema

s.metadata # => {b'great_music': b'reggaeton'}
s.metadata[b'great_music'] # => b'reggaeton'

Fetch the metadata associated with the release_year column:

parquet_file.schema.field('release_year').metadata[b'portuguese'] # => b'ano'
like image 43
Powers Avatar answered Oct 02 '22 15:10

Powers