How do I store custom metadata to a ParquetDataset
using pyarrow?
For example, if I create a Parquet dataset using Dask
import dask
dask.datasets.timeseries().to_parquet('temp.parq')
I can then read it using pyarrow
import pyarrow.parquet as pq
dataset = pq.ParquetDataset('temp.parq')
However, the same method I would use for writing metadata for a single parquet file (outlined in How to write Parquet metadata with pyarrow?) does not work for a ParquetDataset
, since there is no replace_schema_metadata
function or similar.
I think I would probably like to write a custom _custom_metadata
file, as the metadata I'd like to store pertain to the whole dataset. I imagine the procedure would be something similar to:
meta = pq.read_metadata('temp.parq/_common_metadata')
custom_metadata = { b'type': b'mydataset' }
merged_metadata = { **custom_metadata, **meta.metadata }
# TODO: Construct FileMetaData object with merged_metadata
new_meta.write_metadata_file('temp.parq/_common_metadata')
In addition to the data types, Parquet specification also stores metadata which records the schema at three levels; file, chunk(column) and page header. The footer for each file contains the file metadata.
One possibility (that does not directly answer the question) is to use dask.
import dask
# Sample data
df = dask.datasets.timeseries()
df.to_parquet('test.parq', custom_metadata={'mymeta': 'myvalue'})
Dask does this by writing the metadata to all the files in the directory, including _common_metadata
and _metadata
.
from pathlib import Path
import pyarrow.parquet as pq
files = Path('test.parq').glob('*')
all([b'mymeta' in pq.ParquetFile(file).metadata.metadata for file in files])
# True
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With