I can download a single snappy.parquet partition file with:
aws s3 cp s3://bucket/my-data.parquet/my-data-0000.snappy.parquet ./my-data-0000.snappy.parquet
And then use:
parquet-tools head my-data-0000.snappy.parquet
parquet-tools schema my-data-0000.snappy.parquet
parquet-tools meta my-data-0000.snappy.parquet
But I'd rather not download the file, and I'd rather not have to specify a particular snappy.parquet file. Instead the prefix: "s3://bucket/my-data.parquet"
Also what if the schema is different in different row groups across different partition files?
Following instructions here I downloaded a jar file and ran
hadoop jar parquet-tools-1.9.0.jar schema s3://bucket/my-data.parquet/
But this resulted in error: No FileSystem for schema "s3".
This answer seems promising, but only for reading from HDFS. Any solution for S3?
I wrote the tool clidb to help with this kind of "quick peek at a parquet file in S3" task.
You should be able to do:
pip install "clidb[extras]"
clidb s3://bucket/
and then click to load parquet files as views to inspect and run SQL against.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With