Correct Parquet file size when storing in S3?

Tags:

I've been reading few questions regarding this topic and also several forums, and in all of them they seem to be mentioning that each of resulting .parquet files coming out from Spark should be either 64MB or 1GB size, but still can't make my mind around which case scenarios belong to each of those file sizes and the reasons behind apart from HDFS splitting them in 64MB blocks.

My current testing scenario is the following.

dataset
  .coalesce(n) # being 'n' 4 or 48 - reasons explained below.
  .write
  .mode(SaveMode.Append)
  .partitionBy(CONSTANTS)
  .option("basepath", outputPath)
  .parquet(outputPath)

I'm currently handling a total of 2.5GB to 3GB of daily data, that will be split and saved into daily buckets per year. The reasons behind 'n' being 4 or 48 is just for testing purposes, as I know the size of my testing set in advance, I try to get a number as close to 64MB or 1GB as I can. I haven't implemented code to buffer the needed data until I get the exact size I need prior saving.

So my question here is...

Should I take the size that much into account if I'm not planning to use HDFS and merely store and retrieve data from S3?

And also, which should be the optimal size for daily datasets of around 10GB maximum if I'm planning to use HDFS to store my resulting .parquet files?

Any other optimization tip would be really appreciated!

250

asked Jan 22 '19 09:01

Lenny D.

1 Answers

You can control the split size of parquet files, provided you save them with a splittable compression like snappy. For the s3a connector, just set fs.s3a.block.size to a different number of bytes.

Smaller split size

More workers can work on a file simultaneously. Speedup if you have idle workers.
More startup overhead scheduling work, starting processing, committing tasks
Creates more files from the output, unless you repartition.

Small files vs large files

Small files:

you get that small split whether or not you want it.
even if you use unsplittable compression.
takes longer to list files. Listing directory trees on s3 is very slow
impossible to ask for larger block sizes than the file length
easier to save if your s3 client doesn't do incremental writes in blocks. (Hadoop 2.8+ does if you set spark.hadoop.fs.s3a.fast.upload true.

Personally, and this is opinion, and some benchmark driven -but not with your queries

Writing

save to larger files.
with snappy.
shallower+wider directory trees over deep and narrow

Reading

play with different block sizes; treat 32-64 MB as a minimum
Hadoop 3.1, use the zero-rename committers. Otherwise, switch to v2
if your FS connector supports this make sure random IO is turned on (hadoop-2.8 + spark.hadoop.fs.s3a.experimental.fadvise random
save to larger files via .repartion().
Keep an eye on how much data you are collecting, as it is very easy to run up large bills from storing lots of old data.

see also Improving Spark Performance with S3/ADLS/WASB

130

answered Oct 09 '22 15:10

stevel

Related questions
                            
                                How to Read Data from DB in Spark in parallel
                            
                                How to do aggregation on multiple columns at once in Spark
                            
                                spark jdbc df limit... what is it doing?
                            
                                How to get max length of string column from dataframe using scala?
                            
                                Custom partitioner in SPARK (pyspark)
                            
                                Check if arraytype column contains null
                            
                                PySpark, top for DataFrame
                            
                                Writing Spark dataframe as parquet to S3 without creating a _temporary folder
                            
                                How to export data from Cassandra to BigQuery
                            
                                How to get date from different year, month and day columns in spark (scala)
                            
                                How to wait until all executors are allocated before Spark application starts on YARN?
                            
                                Build Spark SQL query dynamically
                            
                                Why does Spark on YARN in cluster mode fail with "Exception in thread "Driver" java.lang.NullPointerException"?
                            
                                PySpark: create dataframe from random uniform disribution
                            
                                How to force a certain partitioning in a PySpark DataFrame?
                            
                                Coalesce columns in spark dataframe
                            
                                Dataframe: how to groupBy/count then order by count in Scala
                            
                                Error using spark 'save' does not support bucketing right now
                            
                                How to find installation directory of Apache Spark package in Homebrew?
                            
                                Get index of item in array that is a column in a Spark dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Correct Parquet file size when storing in S3?

Tags:

apache-spark

hdfs

parquet

Lenny D.

People also ask

1 Answers

stevel

Recent Activity

Donate For Us