Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

parquet version used to write a file

is there a way to find out what parquet version was used to write a parquet file in HDFS? I'm trying to see if various files were written using the same parquet version or different versions.

like image 253
lightweight Avatar asked Nov 18 '15 16:11

lightweight


People also ask

What is parquet file format?

Parquet is an open source file format built to handle flat columnar storage data formats. Parquet operates well with complex data in large volumes.It is known for its both performant data compression and its ability to handle a wide variety of encoding types.

How do I write a file in PySpark?

In PySpark you can save (write/extract) a DataFrame to a CSV file on disk by using dataframeObj. write. csv("path") , using this you can also write DataFrame to AWS S3, Azure Blob, HDFS, or any PySpark supported file systems.

Why do we use parquet file format?

Using Parquet files will enable you to fetch only the required columns and their values, load those in memory and answer the query. If a row-based file format like CSV was used, the entire table would have to have been loaded in memory, resulting in increased I/O and worse performance.


1 Answers

$ hadoop jar parquet-tools-1.9.0.jar meta my-parquet-file.parquet |grep "parquet-mr version"

creator:                     parquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
like image 78
Jimmy Da Avatar answered Oct 04 '22 17:10

Jimmy Da