Is it better to have one large parquet file or lots of smaller parquet files?

Tags:

I understand hdfs will split files into something like 64mb chunks. We have data coming in streaming and we can store them to large files or medium sized files. What is the optimum size for columnar file storage? If I can store files to where the smallest column is 64mb, would it save any computation time over having, say, 1gb files?

532

asked Mar 21 '17 04:03

ForeverConfused

2 Answers

Aim for around 1GB per file (spark partition) (1).

Ideally, you would use snappy compression (default) due to snappy compressed parquet files being splittable (2).

Using snappy instead of gzip will significantly increase the file size, so if storage space is an issue, that needs to be considered.

.option("compression", "gzip") is the option to override the default snappy compression.

If you need to resize/repartition your Dataset/DataFrame/RDD, call the .coalesce(<num_partitions> or worst case .repartition(<num_partitions>) function. Warning: repartition especially but also coalesce can cause a reshuffle of the data, so use with some caution.

Also, parquet file size and for that matter all files generally should be greater in size than the HDFS block size (default 128MB).

1) https://forums.databricks.com/questions/101/what-is-an-optimal-size-for-file-partitions-using.html 2) http://boristyukin.com/is-snappy-compressed-parquet-file-splittable/

answered Oct 24 '22 06:10

Garren S

Notice that Parquet files are internally split into row groups

parquet layout

https://parquet.apache.org/documentation/latest/

So by making parquet files larger, row groups can still be the same if your baseline parquet files were not small/tiny. There is no huge direct penalty on processing, but opposite, there are more opportunities for readers to take advantage of perhaps larger/ more optimal row groups if your parquet files were smaller/tiny for example as row groups can't span multiple parquet files.

Also larger parquet files don't limit parallelism of readers, as each parquet file can be broken up logically into multiple splits (consisting of one or more row groups).

The only downside of larger parquet files is it takes more memory to create them. So you can watch out if you need to bump up Spark executors' memory.

row groups are a way for Parquet files to have vertical partitioning. Each row group has many row chunks (one for each column, a way to provide horizontal partitioning for the datasets in parquet).

answered Oct 24 '22 07:10

Tagar

Related questions
                            
                                TypeScript - [string] vs string[]
                            
                                Why did Spring framework deprecate the use of Guava cache?
                            
                                How can I overload the unary negative (minus) operator in Python?
                            
                                Android: finishAffinity() vs finishAndRemoveTask()
                            
                                Crawling multiple URLs in a loop using Puppeteer
                            
                                PCA on sklearn - how to interpret pca.components_
                            
                                Asp.net Core Identity Use AspNetUserClaims or AspNetRoleClaims?
                            
                                What does enable_testing() do in cmake?
                            
                                Running dotnet command line within Visual Studio
                            
                                Why do we need template0 and template1 in PostgreSQL?
                            
                                Change color of Select component's border and arrow icon Material UI
                            
                                Git, error: remote unpack failed: unable to create temporary object directory - By creating new Branch

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With