How to store custom Parquet Dataset metadata with pyarrow?

Tags:

How do I store custom metadata to a ParquetDataset using pyarrow?

For example, if I create a Parquet dataset using Dask

import dask
dask.datasets.timeseries().to_parquet('temp.parq')

I can then read it using pyarrow

import pyarrow.parquet as pq
dataset = pq.ParquetDataset('temp.parq')

However, the same method I would use for writing metadata for a single parquet file (outlined in How to write Parquet metadata with pyarrow?) does not work for a ParquetDataset, since there is no replace_schema_metadata function or similar.

I think I would probably like to write a custom _custom_metadata file, as the metadata I'd like to store pertain to the whole dataset. I imagine the procedure would be something similar to:

meta = pq.read_metadata('temp.parq/_common_metadata')
custom_metadata = { b'type': b'mydataset' }
merged_metadata = { **custom_metadata, **meta.metadata }
# TODO: Construct FileMetaData object with merged_metadata
new_meta.write_metadata_file('temp.parq/_common_metadata')

468

asked Sep 10 '21 11:09

Dahn

Video Answer

1 Answers

One possibility (that does not directly answer the question) is to use dask.

import dask

# Sample data
df = dask.datasets.timeseries()

df.to_parquet('test.parq', custom_metadata={'mymeta': 'myvalue'})

Dask does this by writing the metadata to all the files in the directory, including _common_metadata and _metadata.

from pathlib import Path
import pyarrow.parquet as pq

files = Path('test.parq').glob('*')

all([b'mymeta' in pq.ParquetFile(file).metadata.metadata for file in files])
# True

177

answered Oct 21 '22 22:10

Dahn

Related questions
                            
                                How to Programmatically detect whether a file is a Python script
                            
                                How to fix /usr/local/bin/virtualenv: /usr/bin/python: bad interpreter: No such file or directory?
                            
                                Extended example to understand CUDA, Numba, Cupy, etc
                            
                                When/Where does PyPy produce machine code?
                            
                                error when using Mirrored strategy in Tensorflow
                            
                                How do I parse a chemical formula using a regular expression?
                            
                                Interesting results with duplicate columns in pandas.DataFrame
                            
                                How to use the kubernetes-client for executing "kubectl apply"
                            
                                Failed to build opencv-contrib-python (On Rasberry Pi)
                            
                                Shap installation
                            
                                how to set WSGI of appache2 to work with python 3.7?
                            
                                flask-ngrok returns "Tunnel _________.ngrok.io not found" when running flask app via ngrok on Google Colab [duplicate]
                            
                                How to handle odd resolutions in Unet architecture PyTorch
                            
                                Keras custom loss function to ignore false negatives of a specific class during semantic segmentation?
                            
                                Pandas changing values when inferring dtypes
                            
                                Error while Importing pyspark ETL module and running as child process using pything subprocess
                            
                                Can getline() be used multiple times within a loop? - Cython, file reading
                            
                                Reorder Sankey diagram vertically based on label value
                            
                                What is the correct way to update an slqalchemy orm column from a pandas dataframe column
                            
                                Selenium + Flask/Falcon in Python - 502 Bad Gateway Error

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to store custom Parquet Dataset metadata with pyarrow?

Tags:

python

parquet

pyarrow

Dahn

People also ask

Video Answer

1 Answers

Dahn

Recent Activity

Donate For Us