I have a file stored in HDFS as <code>part-m-00000.gz.parquet</code> I've tried to run <code>hdfs dfs -text dir/part-m-00000.gz.parquet</code> but it's compressed, so I ran <code>gunzip part-m-00000.gz.parquet</code> but it doesn't uncompress the file since it doesn't recognise the <code>.parquet</code> extension. How do I get the schema / column names for this file?

You won't be able "open" the file using a hdfs dfs -text because its not a text file. Parquet files are written to disk very differently compared to text files. And for the same matter, the Parquet project provides parquet-tools to do tasks like which you are trying to do. Open and see the schema, data, metadata etc. Check out the parquet-tool project (which is put simply, a jar file.) parquet-tools Also Cloudera which support and contributes heavily to Parquet, also has a nice page with examples on usage of parquet-tools. A example from that page for your use case is <pre class="prettyprint"><code>parquet-tools schema part-m-00000.parquet </code></pre> Checkout the Cloudera page. Using the Parquet File Format with Impala, Hive, Pig, HBase, and MapReduce

How do I get schema / column names from parquet file?

Tags:

hadoop

apache-pig

hdfs

parquet

I have a file stored in HDFS as part-m-00000.gz.parquet

I've tried to run hdfs dfs -text dir/part-m-00000.gz.parquet but it's compressed, so I ran gunzip part-m-00000.gz.parquet but it doesn't uncompress the file since it doesn't recognise the .parquet extension.

How do I get the schema / column names for this file?

489

asked Nov 24 '15 00:11

Super_John

1 Answers

You won't be able "open" the file using a hdfs dfs -text because its not a text file. Parquet files are written to disk very differently compared to text files.

And for the same matter, the Parquet project provides parquet-tools to do tasks like which you are trying to do. Open and see the schema, data, metadata etc.

Check out the parquet-tool project (which is put simply, a jar file.) parquet-tools

Also Cloudera which support and contributes heavily to Parquet, also has a nice page with examples on usage of parquet-tools. A example from that page for your use case is

parquet-tools schema part-m-00000.parquet

Checkout the Cloudera page. Using the Parquet File Format with Impala, Hive, Pig, HBase, and MapReduce

answered Sep 21 '22 16:09

Urvishsinh Mahida

Related questions
                            
                                Hadoop one Map and multiple Reduce
                            
                                putting a remote file into hadoop without copying it to local disk
                            
                                What is Google's Dremel? How is it different from Mapreduce?
                            
                                Hadoop DistributedCache is deprecated - what is the preferred API?
                            
                                Easiest way to install Python dependencies on Spark executor nodes?
                            
                                Spark Unable to load native-hadoop library for your platform
                            
                                Where HDFS stores files locally by default?
                            
                                Difference between `yarn.scheduler.maximum-allocation-mb` and `yarn.nodemanager.resource.memory-mb`?
                            
                                Spark Scala list folders in directory
                            
                                Loading Data from a .txt file to Table Stored as ORC in Hive
                            
                                When using --negotiate with curl, is a keytab file required?
                            
                                view contents of file in hdfs hadoop
                            
                                List the namenode and datanodes of a cluster from any node?
                            
                                HBase REST Filter ( SingleColumnValueFilter )
                            
                                Why isn't Hadoop implemented using MPI?
                            
                                How do you make a HIVE table out of JSON data?
                            
                                Download large data for Hadoop [closed]
                            
                                What is the relationship between Spark, Hadoop and Cassandra
                            
                                Cannot Read a file from HDFS using Spark
                            
                                How to choose between Cassandra, Membase, Hadoop, MongoDB, RDBMS etc.? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With