Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correct Parquet file size when storing in S3?

I've been reading few questions regarding this topic and also several forums, and in all of them they seem to be mentioning that each of resulting .parquet files coming out from Spark should be either 64MB or 1GB size, but still can't make my mind around which case scenarios belong to each of those file sizes and the reasons behind apart from HDFS splitting them in 64MB blocks.

My current testing scenario is the following.

dataset
  .coalesce(n) # being 'n' 4 or 48 - reasons explained below.
  .write
  .mode(SaveMode.Append)
  .partitionBy(CONSTANTS)
  .option("basepath", outputPath)
  .parquet(outputPath)

I'm currently handling a total of 2.5GB to 3GB of daily data, that will be split and saved into daily buckets per year. The reasons behind 'n' being 4 or 48 is just for testing purposes, as I know the size of my testing set in advance, I try to get a number as close to 64MB or 1GB as I can. I haven't implemented code to buffer the needed data until I get the exact size I need prior saving.

So my question here is...

Should I take the size that much into account if I'm not planning to use HDFS and merely store and retrieve data from S3?

And also, which should be the optimal size for daily datasets of around 10GB maximum if I'm planning to use HDFS to store my resulting .parquet files?

Any other optimization tip would be really appreciated!

like image 250
Lenny D. Avatar asked Jan 22 '19 09:01

Lenny D.


People also ask

What is the best size for Parquet file?

The official Parquet documentation recommends a disk block/row group/file size of 512 to 1024 MB on HDFS.

Is it better to have one large Parquet file or lots of smaller Parquet files?

There is no huge direct penalty on processing, but opposite, there are more opportunities for readers to take advantage of perhaps larger/ more optimal row groups if your parquet files were smaller/tiny for example as row groups can't span multiple parquet files.

Does S3 support Parquet?

You can also get Amazon S3 inventory reports in Parquet or ORC format. Amazon S3 inventory gives you a flat file list of your objects and metadata. You can get the S3 inventory for CSV, ORC or Parquet formats.

Does Parquet reduce file size?

Converting our datasets from row-based (CSV) to columnar (parquet) has significantly reduced the file size.


1 Answers

You can control the split size of parquet files, provided you save them with a splittable compression like snappy. For the s3a connector, just set fs.s3a.block.size to a different number of bytes.

Smaller split size

  • More workers can work on a file simultaneously. Speedup if you have idle workers.
  • More startup overhead scheduling work, starting processing, committing tasks
  • Creates more files from the output, unless you repartition.

Small files vs large files

Small files:

  • you get that small split whether or not you want it.
  • even if you use unsplittable compression.
  • takes longer to list files. Listing directory trees on s3 is very slow
  • impossible to ask for larger block sizes than the file length
  • easier to save if your s3 client doesn't do incremental writes in blocks. (Hadoop 2.8+ does if you set spark.hadoop.fs.s3a.fast.upload true.

Personally, and this is opinion, and some benchmark driven -but not with your queries

Writing

  • save to larger files.
  • with snappy.
  • shallower+wider directory trees over deep and narrow

Reading

  • play with different block sizes; treat 32-64 MB as a minimum
  • Hadoop 3.1, use the zero-rename committers. Otherwise, switch to v2
  • if your FS connector supports this make sure random IO is turned on (hadoop-2.8 + spark.hadoop.fs.s3a.experimental.fadvise random
  • save to larger files via .repartion().
  • Keep an eye on how much data you are collecting, as it is very easy to run up large bills from storing lots of old data.

see also Improving Spark Performance with S3/ADLS/WASB

like image 130
stevel Avatar answered Oct 09 '22 15:10

stevel