How to write Parquet metadata with pyarrow?

Tags:

I use pyarrow to create and analyse Parquet tables with biological information and I need to store some metadata, e.g. which sample the data comes from, how it was obtained and processed.

Parquet seems to support file-wide metadata, but I cannot find how the write it via pyarrow. The closest thing I could find is how to write row-group metadata, but this seems like an overkill, since my metadata is the same for all row groups in the file.

Is there any way to write file-wide Parquet metadata with pyarrow?

766

asked Aug 31 '18 21:08

golobor

2 Answers

Pyarrow maps the file-wide metadata to a field in the table's schema named metadata. Regrettably there is not (yet) documentation on this.

Both the Parquet metadata format and the Pyarrow metadata format represent metadata as a collection of key/value pairs where both key & value must be strings. This is unfortunate as it would be more flexible if it were just a UTF-8 encoded JSON object. Furthermore, since these are std::string objects in the C++ implementation they are "b strings" (bytes) objects in Python.

Pyarrow currently stores some of its own information in the metadata field. It has one built in key b'ARROW:schema' and another builtin key b'pandas'. In the pandas case the value is a JSON object encoded with UTF-8. This allows for namespacing. The "pandas" schema can have as many fields as it needs and they are all namespaced under "pandas". Pyarrow uses the "pandas" schema to store information about what kind of index the table has as well as what type of encoding a column uses (when there is more than one possible pandas encoding for a given data type). I am uncertain what the b'ARROW:schema' represents. It appears to be encoded in some way I don't recognize and I have not really played around with it. I assume it's intended to record similar things to the "pandas" schema.

The last thing we need to know to answer your question is that all pyarrow objects are immutable. So there is no way to simply add fields to the schema. Pyarrow does have the schema utility method with_metadata which returns a clone of a schema object but with your own metadata but this replaces the existing metadata and does not append to it. There is also the experimental method on the Table object replace_schema_metadata but this also replaces and does not update. So if you want to keep the existing metadata you have to do some more work. Putting this all together we get...

custom_metadata = {'Sample Number': '12', 'Date Obtained': 'Tuesday'}
existing_metadata = table.schema.metadata
merged_metadata = { **custom_metadata, **existing_metadata }
fixed_table = table.replace_schema_metadata(merged_metadata)

Once this table is saved as a parquet file it will include the key/value metadata fields (at the file level) for Sample Number and Date Obtained.

Also, note that the replace_schema_metadata and with_metadata methods are tolerant of taking in regular python strings (like in my example). However, it will convert these to "b strings" so if you want to access fields in the schema you must use the "b string". For example, if you had just read in a table and wanted to get the sample number you must use table.schema.metadata[b'Sample Number'] and table.schema.metadats['Sample Number'] will give you a KeyError.

As you start to use this you may realize it is a pain to constantly have to be mapping Sample Number back and forth to an integer. Furthermore, if your metadata is represented in your application as a large nested object it can be a pain to map this object to a collection of string/string pairs. Also, it's a pain to constantly be remembering the "b string" keys. The solution is to do the same thing the pandas schema does. First convert your metadata to a JSON object. Then convert the JSON object to a "b string".

custom_metadata_json = {'Sample Number': 12, 'Date Obtained': 'Tuesday'}
custom_metadata_bytes = json.dumps(custom_metadata_json).encode('utf8')
existing_metadata = table.schema.metadata
merged_metadata = { **{'Record Metadata': custom_metadata_bytes}, **existing_metadata }

Now you can have as many metadata fields as you want, nested in any way you want, using any of the standard JSON types and it will all be namespaced into a single key/value pair (in this case named "Record Metadata").

100

answered Oct 02 '22 16:10

Pace

This example shows how to create a Parquet file with file metadata and column metadata with PyArrow.

Suppose you have the following CSV data:

movie,release_year
three idiots,2009
her,2013

Read the CSV into a PyArrow table and define a custom schema with column / file metadata:

import pyarrow.csv as pv
import pyarrow.parquet as pq
import pyarrow as pa

table = pv.read_csv('movies.csv')

my_schema = pa.schema([
    pa.field("movie", "string", False, metadata={"spanish": "pelicula"}),
    pa.field("release_year", "int64", True, metadata={"portuguese": "ano"})],
    metadata={"great_music": "reggaeton"})

Create a new table with my_schema and write it out as a Parquet file:

t2 = table.cast(my_schema)

pq.write_table(t2, 'movies.parquet')

Read the Parquet file and fetch the file metadata:

s = pq.read_table('movies.parquet').schema

s.metadata # => {b'great_music': b'reggaeton'}
s.metadata[b'great_music'] # => b'reggaeton'

Fetch the metadata associated with the release_year column:

parquet_file.schema.field('release_year').metadata[b'portuguese'] # => b'ano'

answered Oct 02 '22 15:10

Powers

Related questions
                            
                                ImportError: No module named serial
                            
                                Is it possible to set the marker edge alpha in Matplotlib?
                            
                                Avoiding MySQL deadlock in Django ORM
                            
                                CSV Exports - Ordering of columns using scrapy crawl -o output.csv
                            
                                Return list of objects as dictionary with keys as the objects id with django rest framerwork
                            
                                How can i combine flask and nameko?
                            
                                Is it possible to hide Python function arguments in Sphinx?
                            
                                ValueError: Series lengths must match to compare when matching dates in Pandas
                            
                                Python requests - threads/processes vs. IO
                            
                                Insert the folium maps into the jinja template
                            
                                How to plot pie charts as subplots with custom size with Plotly in Python
                            
                                How to index a list with a TensorFlow tensor?
                            
                                Increase Version number if Travis at github was successful
                            
                                What is Python's sequence protocol?
                            
                                Nested data in Parquet with Python
                            
                                "OSError: [Errno 22] Invalid argument" when read()ing a huge file
                            
                                Share a dictionary of pandas dataframe across multiprocessing python
                            
                                Double requirement given when trying to use pip install pandas
                            
                                Why are attributes lost after copying a Pandas DataFrame
                            
                                get the lists of functions used/called within a function in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to write Parquet metadata with pyarrow?

Tags:

python

parquet

pyarrow

golobor

People also ask

2 Answers

Pace

Powers

Recent Activity

Donate For Us