I have a file stored in HDFS as part-m-00000.gz.parquet
I've tried to run hdfs dfs -text dir/part-m-00000.gz.parquet
but it's compressed, so I ran gunzip part-m-00000.gz.parquet
but it doesn't uncompress the file since it doesn't recognise the .parquet
extension.
How do I get the schema / column names for this file?
Overall, Parquet's features of storing data in columnar format together with schema and typed data allow efficient use for analytical purposes.
Schema Merging Like Protocol Buffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas.
You won't be able "open" the file using a hdfs dfs -text because its not a text file. Parquet files are written to disk very differently compared to text files.
And for the same matter, the Parquet project provides parquet-tools to do tasks like which you are trying to do. Open and see the schema, data, metadata etc.
Check out the parquet-tool project (which is put simply, a jar file.) parquet-tools
Also Cloudera which support and contributes heavily to Parquet, also has a nice page with examples on usage of parquet-tools. A example from that page for your use case is
parquet-tools schema part-m-00000.parquet
Checkout the Cloudera page. Using the Parquet File Format with Impala, Hive, Pig, HBase, and MapReduce
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With