I have around 100 GB of data per day which I write to S3 using Spark. The write format is parquet. The application which writes this run Spark 2.3
The 100 GB data is further partitioned, where the largest partition is 30 GB. For this case, let's just consider that 30 GB partition.
We are planning to migrate this whole data and rewrite to S3, in Spark 2.4. Initially we didn't decide on file size and block size when writing to S3. Now that we are going to rewrite everything, we want to take into consideration the optimal file size and parquet block size.
Before talking about the parquet side of the equation, one thing to consider is how the data will be used after you save it to parquet. If it's going to be read/processed often, you may want to consider what are the access patterns and decide to partition it accordingly. One common pattern is partitioning by date, because most of our queries have a time range. Partitioning your data appropriately will have a much bigger impact on performance on using that data after it is written.
Now, onto Parquet, the rule of thumb is for the parquet block size to be roughly the size of the underlying file system. That matters when you're using HDFS, but it doesn't matter much when you're using S3.
Again, the consideration for the Parquet block size, is how you're reading the data. Since a Parquet block has to be basically reconstructed in memory, the larger it is, the more memory is needed downstream. You also will need fewer workers, so if your downstream workers have plenty of memory you can have larger parquet blocks as it will be slightly more efficient.
However, for better scalability, it's usually better having several smaller objects - especially according to some partitioning scheme - versus one large object, which may act as a performance bottleneck, depending on your use case.
To sum it up:
getSplits()
the first step in your spark job will have 60 tasks. They will use byte-range fetches to get different parts of the same S3 object in parallel. However, you'll get better performance if you break it down in several smaller (preferably partitioned) S3 objects, since they can be written in parallel (one large file has to be written sequentially) and also most likely have better reading performance when accessed by a large number of readers.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With