Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to enable storage partitioned join in spark/iceberg?

How do I use the storage partitioned join feature in Spark 3.3.0? I've tried it out, and my query plan still shows the expensive ColumnarToRow and Exchange steps. My setup is as follows:

  • joining two Iceberg tables, both partitioned on hours(ts), bucket(20, id)
  • join attempted on a.id = b.id AND a.ts = b.ts and on a.id = b.id
  • tables are large, 100+ partitions used, 100+ GB of data to join
  • spark: 3.3.0
  • iceberg: org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:0.14.1
  • set my spark session config with spark.sql.sources.v2.bucketing.enabled=true

I read through all the docs I could find on the storage partitioned join feature:

  • tracker
  • SPIP
  • PR
  • Youtube demo

I'm wondering if there are other things I need to configure, if there needs to be something implemented in Iceberg still, or if I've set up something wrong. I'm super excited about this feature. It could really speed up some of our large joins.

like image 656
James D Avatar asked Sep 03 '25 09:09

James D


1 Answers

Support for storage-partitioned joins (SPJ) has been added to Iceberg in PR #6371 and released in Apache Iceberg 1.2.0 on March 20th, 2023.

Spark added support for SPJ for v2 sources only in 3.3, so earlier versions can't benefit from this feature.

like image 61
Anton Okolnychyi Avatar answered Sep 05 '25 01:09

Anton Okolnychyi