Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does Spark SQL include a table streaming optimization for joins?

Does Spark SQL include a table streaming optimization for joins and, if so, how does it decide which table to stream?

When doing joins, Hive assumes the last table is the largest one. As a join optimization, it will attempt to buffer the smaller join tables and stream the last one through. If the last table in the join list is not the largest one, Hive has the /*+ STREAMTABLE(tbl) */ hint which tells it the table that should be streamed. As of v1.4.1, Spark SQL does not support the STREAMTABLE hint.

This question has been asked for normal RDD processing, outside of Spark SQL, here. The answer does not apply to Spark SQL where the developer has no control of explicit cache operations.

like image 969
Sim Avatar asked Sep 27 '22 23:09

Sim


1 Answers

I have looked for an answer to this question some time ago and all I could come up with was setting a spark.sql.autoBroadcastJoinThreshold parameter, which is by default 10 MB. It will then attempt to automatically broadcast all the tables with size smaller than the limit set by you. Join order plays no role here for this setting.

If you are interestend in further improving join performance, I highly recommend this presentation.

like image 86
TheMP Avatar answered Sep 30 '22 13:09

TheMP