Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is gzipped Parquet file splittable in HDFS for Spark?

I get confusing messages when searching and reading answers on the internet on this subject. Anyone can share their experience? I know for a fact that gzipped csv is not, but maybe file internal structures for Parquet are such that it is totally different case for Parquet vs csv?

like image 323
YuGagarin Avatar asked Apr 10 '17 13:04

YuGagarin


People also ask

Are Parquet files Splittable?

Yes, Parquet files are splittable. S3 supports positioned reads (range requests), which can be used to read only selected portions of the input file (object).

Can Parquet files be Gzipped?

Gzip is supported by Spark and by Parquet, but not like this. Apache Parquet is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem.

How does HDFS store Parquet files?

Each block in the parquet file is stored in the form of row groups. So, data in a parquet file is partitioned into multiple row groups. These row groups in turn consists of one or more column chunks which corresponds to a column in the dataset. The data for each column chunk is then written in the form of pages.

Why Parquet is best fit for spark?

Parquet has higher execution speed compared to other standard file formats like Avro,JSON etc and it also consumes less disk space in compare to AVRO and JSON.


1 Answers

Parquet files with GZIP compression are actually splittable. This is because of the internal layout of Parquet files. These are always splittable, independent of the used compression algorithm.

This fact is mainly due to the design of Parquet files that divided in the following parts:

  1. Each Parquet files consists of several RowGroups, these should be the same size as your HDFS Block Size.
  2. Each RowGroup consists of a ColumnChunk per column. Each ColumnChunk in a RowGroup has the same number of Rows.
  3. ColumnChunks are split into Pages, these are probably in the size of 64KiB to 16MiB. Compression is done on a per-page basis, thus a page is the lowest level of parallelisation a job can work on.

You can find a more detailed explanation here: https://github.com/apache/parquet-format#file-format

like image 97
Uwe L. Korn Avatar answered Nov 20 '22 06:11

Uwe L. Korn