Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I get schema / column names from parquet file?

I have a file stored in HDFS as part-m-00000.gz.parquet

I've tried to run hdfs dfs -text dir/part-m-00000.gz.parquet but it's compressed, so I ran gunzip part-m-00000.gz.parquet but it doesn't uncompress the file since it doesn't recognise the .parquet extension.

How do I get the schema / column names for this file?

like image 489
Super_John Avatar asked Nov 24 '15 00:11

Super_John


People also ask

Does parquet file have schema?

Overall, Parquet's features of storing data in columnar format together with schema and typed data allow efficient use for analytical purposes.

Does Parquet allow schema evolution?

Schema Merging Like Protocol Buffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas.


1 Answers

You won't be able "open" the file using a hdfs dfs -text because its not a text file. Parquet files are written to disk very differently compared to text files.

And for the same matter, the Parquet project provides parquet-tools to do tasks like which you are trying to do. Open and see the schema, data, metadata etc.

Check out the parquet-tool project (which is put simply, a jar file.) parquet-tools

Also Cloudera which support and contributes heavily to Parquet, also has a nice page with examples on usage of parquet-tools. A example from that page for your use case is

parquet-tools schema part-m-00000.parquet 

Checkout the Cloudera page. Using the Parquet File Format with Impala, Hive, Pig, HBase, and MapReduce

like image 59
Urvishsinh Mahida Avatar answered Sep 21 '22 16:09

Urvishsinh Mahida