Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark performance enhancements by storing sorted Parquet files

Will data extracts run quicker if a DataFrame is sorted before being persisted as Parquet files.

Suppose we have the following peopleDf DataFrame (pretend this a sample and the real one has 20 billion rows):

+-----+----------------+
| age | favorite_color |
+-----+----------------+
|  54 | blue           |
|  10 | black          |
|  13 | blue           |
|  19 | red            |
|  89 | blue           |
+-----+----------------+

Let's write out sorted and unsorted versions of this DataFrame to Parquet files.

peopleDf.write.parquet("s3a://some-bucket/unsorted/")
peopleDf.sort($"favorite_color").write.parquet("s3a://some-bucket/sorted/")

Are there any performance gains when reading in the sorted data and doing a data extract based on favorite_color?

val pBlue1 = spark.read.parquet("s3a://some-bucket/unsorted/").filter($"favorite_color" === "blue")

// is this faster?

val pBlue2 = spark.read.parquet("s3a://some-bucket/sorted/").filter($"favorite_color" === "blue")
like image 586
Powers Avatar asked Nov 14 '16 00:11

Powers


1 Answers

Sorting provides a number of benefits:

  • more efficient filtering using file metadata.
  • more efficient compression rate.

If you want to filter on single column partitioning on that column can be more efficient and doesn't require shuffle although there some related issues right now:

  • Spark lists all leaf node even in partitioned data
like image 51
2 revsuser6022341 Avatar answered Oct 08 '22 00:10

2 revsuser6022341