Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to assign arbitrary metadata to pyarrow.Table / Parquet columns

Use-case

I am using Apache Parquet files as a fast IO format for large-ish spatial data that I am working on in Python with GeoPandas. I am storing feature geometries as WKB and would like to record the coordinate reference system (CRS) as metadata associated with the WKB data.

Code problem

I am trying to assign arbitrary metadata to a pyarrow.Field object.

What I've tried

Suppose table is a pyarrow.Table instantiated from df, a pandas.DataFrame:

df = pd.DataFrame({
        'foo' : [1, 3, 2],
        'bar' : [6, 4, 5]
        })

table = pa.Table.from_pandas(df)

According to the pyarrow docs, column metadata is contained in a field which belongs to a schema (source), and optional metadata may be added to a field (source).

If I try to assign a value to the metadata attribute, it raises an error:

>>> table.schema.field_by_name('foo').metadata = {'crs' : '4283'}
AttributeError: attribute 'metadata' of 'pyarrow.lib.Field' objects is not writable

>>> table.column(0).field.metadata = {'crs' : '4283'}
AttributeError: attribute 'metadata' of 'pyarrow.lib.Field' objects is not writable

If I try to assign a field (with metadata associated by way of the add_metadata method) to a field, it returns an error:

>>> table.schema.field_by_name('foo') = (
           table.schema.field_by_name('foo').add_metadata({'crs' : '4283'})
           )
SyntaxError: can't assign to function call

>>> table.column(0).field = table.column(0).field.add_metadata({'crs' : '4283'})
AttributeError: attribute 'field' of 'pyarrow.lib.Column' objects is not writable

I have even tried assigning metadata to a pandas.Series object e.g.

df['foo']._metadata.append({'crs' : '4283'})

but this is not returned in the metadata when calling the pandas_metadata (docs) method on the schema attribute of the table object.

Research

On stackoverflow, this question remains unanswered, and this related question concerns Scala, not Python and pyarrow. Elsewhere I have seen metadata associated with a pyarrow.Field object, but only by instantiating pyarrow.Field and pyarrow.Table objects from the ground up.

PS

This is my first time posting to stackoverflow so thanks in advance and apologies for any errors.

like image 350
d.arcy Avatar asked Apr 06 '19 04:04

d.arcy


2 Answers

"Everything" in Arrow is immutable, so as you experienced, you cannot simply modify the metadata of any field or schema. The only way to do this is to create a "new" table with the added metadata. I put new between quotation marks since this can be done without actually copying the table, as behind the scenes this is just moving pointers around. Here is some code showing how to store arbitrary dictionaries (as long as they're json-serializable) in Arrow metadata and how to retrieve them:

def set_metadata(tbl, col_meta={}, tbl_meta={}):
    """Store table- and column-level metadata as json-encoded byte strings.

    Table-level metadata is stored in the table's schema.
    Column-level metadata is stored in the table columns' fields.

    To update the metadata, first new fields are created for all columns.
    Next a schema is created using the new fields and updated table metadata.
    Finally a new table is created by replacing the old one's schema, but
    without copying any data.

    Args:
        tbl (pyarrow.Table): The table to store metadata in
        col_meta: A json-serializable dictionary with column metadata in the form
            {
                'column_1': {'some': 'data', 'value': 1},
                'column_2': {'more': 'stuff', 'values': [1,2,3]}
            }
        tbl_meta: A json-serializable dictionary with table-level metadata.
    """
    # Create updated column fields with new metadata
    if col_meta or tbl_meta:
        fields = []
        for col in tbl.itercolumns():
            if col.name in col_meta:
                # Get updated column metadata
                metadata = col.field.metadata or {}
                for k, v in col_meta[col.name].items():
                    metadata[k] = json.dumps(v).encode('utf-8')
                # Update field with updated metadata
                fields.append(col.field.add_metadata(metadata))
            else:
                fields.append(col.field)

        # Get updated table metadata
        tbl_metadata = tbl.schema.metadata
        for k, v in tbl_meta.items():
            tbl_metadata[k] = json.dumps(v).encode('utf-8')

        # Create new schema with updated field metadata and updated table metadata
        schema = pa.schema(fields, metadata=tbl_metadata)

        # With updated schema build new table (shouldn't copy data)
        # tbl = pa.Table.from_batches(tbl.to_batches(), schema)
        tbl = pa.Table.from_arrays(list(tbl.itercolumns()), schema=schema)

    return tbl


def decode_metadata(metadata):
    """Arrow stores metadata keys and values as bytes.
    We store "arbitrary" data as json-encoded strings (utf-8),
    which are here decoded into normal dict.
    """
    if not metadata:
        # None or {} are not decoded
        return metadata

    decoded = {}
    for k, v in metadata.items():
        key = k.decode('utf-8')
        val = json.loads(v.decode('utf-8'))
        decoded[key] = val
    return decoded


def table_metadata(tbl):
    """Get table metadata as dict."""
    return decode_metadata(tbl.schema.metadata)


def column_metadata(tbl):
    """Get column metadata as dict."""
    return {col.name: decode_metadata(col.field.metadata) for col in tbl.itercolumns()}


def get_metadata(tbl):
    """Get column and table metadata as dicts."""
    return column_metadata(tbl), table_metadata(tbl)

In short, you create new fields with the added metadata, you aggregate the fields into a new schema, and then you create a new table from the existing table and the new schema. It's all a bit long-winded. Ideally, pyarrow would have convenience functions to do this with fewer lines of code, but last I checked this was the only way to do this.

The only other complication is that metadata is stored as bytes in Arrow, so in the above code I store metadata as json-serializable dictionaries, which I encode in utf-8.

like image 153
thomas Avatar answered Oct 08 '22 10:10

thomas


Here's a less complex way to solve this:

import pandas as pd

df = pd.DataFrame({
        'foo' : [1, 3, 2],
        'bar' : [6, 4, 5]
        })

table = pa.Table.from_pandas(df)

your_schema = pa.schema([
    pa.field("foo", "int64", False, metadata={"crs": "4283"}),
    pa.field("bar", "int64", True)],
    metadata={"diamond": "under_pressure"})

table2 = table.cast(your_schema)

table2.field('foo').metadata[b'crs'] # => b'4283'

I also added a schema metadata field to show how that works.

table2.schema.metadata[b'diamond'] # => b'under_pressure'

Notice that the metadata keys / values are byte strings - that's why it's b'under_pressure' instead of 'under_pressure'. Byte strings are needed because Parquet is a binary file format.

like image 35
Powers Avatar answered Oct 08 '22 12:10

Powers