Inspect Parquet in S3 from Command Line

Question

I can download a single snappy.parquet partition file with:

aws s3 cp s3://bucket/my-data.parquet/my-data-0000.snappy.parquet ./my-data-0000.snappy.parquet

And then use:

parquet-tools head my-data-0000.snappy.parquet
parquet-tools schema my-data-0000.snappy.parquet
parquet-tools meta my-data-0000.snappy.parquet

But I'd rather not download the file, and I'd rather not have to specify a particular snappy.parquet file. Instead the prefix: "s3://bucket/my-data.parquet"

Also what if the schema is different in different row groups across different partition files?

Following instructions here I downloaded a jar file and ran

hadoop jar parquet-tools-1.9.0.jar schema s3://bucket/my-data.parquet/

But this resulted in error: No FileSystem for schema "s3".

This answer seems promising, but only for reading from HDFS. Any solution for S3?

Danny Boland · Accepted Answer

I wrote the tool clidb to help with this kind of "quick peek at a parquet file in S3" task.

You should be able to do:

pip install "clidb[extras]"
clidb s3://bucket/

and then click to load parquet files as views to inspect and run SQL against.

Inspect Parquet in S3 from Command Line

Tags:

amazon-s3

parquet

Wassadamo

1 Answers

Danny Boland

Recent Activity

Donate For Us

Inspect Parquet in S3 from Command Line

Tags:

amazon-s3

parquet

Wassadamo

1 Answers

Danny Boland

Related questions

Recent Activity

Donate For Us