Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does Spark benefit from `sortBy` in persistent table?

Spark v2.4 no Hive

Spark benefit from bucketBy in a way that it knows the DataFrame has the correct partitioning. What about sortBy?

spark.range(100, numPartitions=1).write.bucketBy(3, 'id').sortBy('id').saveAsTable('df')

# No need to `repartition`.
spark.table('df').repartition(3, 'id').explain()
# == Physical Plan ==
# *(1) FileScan parquet default.df2[id#33620L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[df], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>, # SelectedBucketsCount: 3 out of 3

# Still need to `sortWithinPartitions`.
spark.table('df').sortWithinPartitions('id').explain()
# == Physical Plan ==
# *(1) Sort [id#33620L ASC NULLS FIRST], false, 0
# +- *(1) FileScan parquet default.df2[id#33620L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[df], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 3 out of 3

So additional repartition is omitted. However, sortWithinPartitions is not. Is sortBy useful? Can we use sortBy to speed up table joining at all?

like image 804
colinfang Avatar asked Mar 07 '26 12:03

colinfang


1 Answers

Short answer : There is no benefits from sortBy in persistent tables (at the moment at least).

Longer answer :

Spark and Hive do not implement the same semantics or the operational specifications when it comes to bucketing support, although Spark can save bucketed DataFrame into a Hive table.

First, the units of bucketing are different between both frameworks: single bucket file (hive) vs. collection of files per bucket (spark).

Second,

In Hive, each bucket is sorted globally which can optimize queries reading data.

In Spark and till this issue https://issues.apache.org/jira/browse/SPARK-19256 gets (hopefully) resolved, each file is individually sorted but the bucket as a whole is not globally sorted.

Thus, since sorting is not global, there is no benefits form sortBy.

I hope this answers your question.

like image 108
eliasah Avatar answered Mar 09 '26 13:03

eliasah



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!