Is gzipped Parquet file splittable in HDFS for Spark?

Tags:

I get confusing messages when searching and reading answers on the internet on this subject. Anyone can share their experience? I know for a fact that gzipped csv is not, but maybe file internal structures for Parquet are such that it is totally different case for Parquet vs csv?

323

asked Apr 10 '17 13:04

YuGagarin

1 Answers

Parquet files with GZIP compression are actually splittable. This is because of the internal layout of Parquet files. These are always splittable, independent of the used compression algorithm.

This fact is mainly due to the design of Parquet files that divided in the following parts:

Each Parquet files consists of several RowGroups, these should be the same size as your HDFS Block Size.
Each RowGroup consists of a ColumnChunk per column. Each ColumnChunk in a RowGroup has the same number of Rows.
ColumnChunks are split into Pages, these are probably in the size of 64KiB to 16MiB. Compression is done on a per-page basis, thus a page is the lowest level of parallelisation a job can work on.

You can find a more detailed explanation here: https://github.com/apache/parquet-format#file-format

answered Nov 20 '22 06:11

Uwe L. Korn

Related questions
                            
                                spark sbt error: value toDF is not a member of Seq[DataRow]
                            
                                What is Lineage In Spark?
                            
                                How to refresh a table and do it concurrently?
                            
                                How to get the output from console streaming sink in Zeppelin?
                            
                                py4j.protocol.Py4JJavaError occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe
                            
                                How to drop a column from a Databricks Delta table?
                            
                                Spark: optimise writing a DataFrame to SQL Server
                            
                                What is Memory reserved on Yarn
                            
                                Pyspark py4j PickleException: "expected zero arguments for construction of ClassDict"
                            
                                How to sort by value efficiently in PySpark?
                            
                                Create pyspark kernel for Jupyter
                            
                                Do you benefit from the Kryo serializer when you use Pyspark?
                            
                                Spark Dataframe change column value
                            
                                How to read gz compressed file by pyspark
                            
                                How to create a custom streaming data source?
                            
                                Spark: Get top N by key
                            
                                Spark Sql: TypeError("StructType can not accept object in type %s" % type(obj))
                            
                                ValueError: Cannot convert column into bool
                            
                                Spark dataframe add new column with random data
                            
                                Filling gaps in timeseries Spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is gzipped Parquet file splittable in HDFS for Spark?

Tags:

gzip

apache-spark

parquet

YuGagarin

People also ask

1 Answers

Uwe L. Korn

Recent Activity

Donate For Us