Will data extracts run quicker if a DataFrame is sorted before being persisted as Parquet files.
Suppose we have the following peopleDf
DataFrame (pretend this a sample and the real one has 20 billion rows):
+-----+----------------+
| age | favorite_color |
+-----+----------------+
| 54 | blue |
| 10 | black |
| 13 | blue |
| 19 | red |
| 89 | blue |
+-----+----------------+
Let's write out sorted and unsorted versions of this DataFrame to Parquet files.
peopleDf.write.parquet("s3a://some-bucket/unsorted/")
peopleDf.sort($"favorite_color").write.parquet("s3a://some-bucket/sorted/")
Are there any performance gains when reading in the sorted data and doing a data extract based on favorite_color
?
val pBlue1 = spark.read.parquet("s3a://some-bucket/unsorted/").filter($"favorite_color" === "blue")
// is this faster?
val pBlue2 = spark.read.parquet("s3a://some-bucket/sorted/").filter($"favorite_color" === "blue")
Sorting provides a number of benefits:
If you want to filter on single column partitioning on that column can be more efficient and doesn't require shuffle although there some related issues right now:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With