Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How reliable is spark stream join with static databricks delta table

In the databricks there is a cool feature that allows to join a streaming dataframe with a delta table. The cool part is that changes in the delta table are still reflected for a subsequent join results. It works just fine, but I'm curious to know how this works, and what are the limitations here? e.g. what's the expected update delay? How it changes as the delta table grows? Is it safe to rely on it in production?

like image 230
Andrii Black Avatar asked Jan 26 '26 09:01

Andrii Black


1 Answers

Yes, you can rely on this feature (it's really of Spark) - many customers are using it in production. Regarding the other questions - there are multiple aspects here, depending on factors, like, how often table updates, etc.:

  • Because static Delta table isn't cached it's re-read on each join - depending on the cluster configuration, it may not be very bad if you use Delta Caching, so files aren't re-downloaded every time, only new data will be re-downloaded.
  • Read performance could be affected if you have a lot of small files, etc. - it depends on how you're writing into that table & if you do things like OPTIMIZE.
  • Depending on how often the Delta table is updated, you can cache it & periodically refresh it

But really to answer it completely, you need to provide more information specific to your code, use case, etc.

like image 189
Alex Ott Avatar answered Jan 29 '26 13:01

Alex Ott



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!