Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it better to have one large parquet file or lots of smaller parquet files?

Tags:

I understand hdfs will split files into something like 64mb chunks. We have data coming in streaming and we can store them to large files or medium sized files. What is the optimum size for columnar file storage? If I can store files to where the smallest column is 64mb, would it save any computation time over having, say, 1gb files?

like image 532
ForeverConfused Avatar asked Mar 21 '17 04:03

ForeverConfused


People also ask

How large should Parquet files be?

The official Parquet documentation recommends a disk block/row group/file size of 512 to 1024 MB on HDFS.

Does Parquet reduce file size?

One of the benefits of using parquet, is small file sizes. This is important when dealing with large data sets, especially once you start incorporating the cost of cloud storage. Reduced file size is achieved via two methods: File compression.

How do I reduce the number of Parquet files?

You can do "set PARQUET_FILE_SIZE=XX" to fine-tune that max file size up or down until you get it split exactly into 2 files (it will take some trial and error because this is an upper bound - files are actually quite a bit smaller than the limit in my experience).


2 Answers

Aim for around 1GB per file (spark partition) (1).

Ideally, you would use snappy compression (default) due to snappy compressed parquet files being splittable (2).

Using snappy instead of gzip will significantly increase the file size, so if storage space is an issue, that needs to be considered.

.option("compression", "gzip") is the option to override the default snappy compression.

If you need to resize/repartition your Dataset/DataFrame/RDD, call the .coalesce(<num_partitions> or worst case .repartition(<num_partitions>) function. Warning: repartition especially but also coalesce can cause a reshuffle of the data, so use with some caution.

Also, parquet file size and for that matter all files generally should be greater in size than the HDFS block size (default 128MB).

1) https://forums.databricks.com/questions/101/what-is-an-optimal-size-for-file-partitions-using.html 2) http://boristyukin.com/is-snappy-compressed-parquet-file-splittable/

like image 64
Garren S Avatar answered Oct 24 '22 06:10

Garren S


Notice that Parquet files are internally split into row groups

parquet layout

https://parquet.apache.org/documentation/latest/

So by making parquet files larger, row groups can still be the same if your baseline parquet files were not small/tiny. There is no huge direct penalty on processing, but opposite, there are more opportunities for readers to take advantage of perhaps larger/ more optimal row groups if your parquet files were smaller/tiny for example as row groups can't span multiple parquet files.

Also larger parquet files don't limit parallelism of readers, as each parquet file can be broken up logically into multiple splits (consisting of one or more row groups).

The only downside of larger parquet files is it takes more memory to create them. So you can watch out if you need to bump up Spark executors' memory.

row groups are a way for Parquet files to have vertical partitioning. Each row group has many row chunks (one for each column, a way to provide horizontal partitioning for the datasets in parquet).

like image 20
Tagar Avatar answered Oct 24 '22 07:10

Tagar