Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What happens if a Spark broadcast join is too large?

Tags:

apache-spark

In doing Spark performance tuning, I've found (unsurprisingly) that doing broadcast joins eliminates shuffles and improves performance. I've been experimenting with broadcasting on larger joins, and I've been able to successfully use far larger broadcast joins that I expected -- e.g. broadcasting a 2GB compressed (and much larger uncompressed) dataset, running on a 60-node cluster with 30GB memory/node.

However, I have concerns about putting this into production, as the size of our data fluctuates, and I'm wondering what will happen if the broadcast becomes "too large". I'm imagining two scenarios:

A) Data is too big to fit in memory, so some of it gets written to disk, and performance degrades slightly. This would be okay. Or,

B) Data is too big to fit in memory, so it throws an OutOfMemoryError and crashes the whole application. Not so okay.

So my question is: What happens when a Spark broadcast join is too large?

like image 343
Jason Evans Avatar asked Oct 15 '25 15:10

Jason Evans


1 Answers

Broadcast variables are plain local objects and excluding distribution and serialization they the behave as any other object you use. If they don't fit into memory you'll get OOM. Other than memory paging there is no magic that can prevent that.

So broadcasting is not applicable for objects that may not fit into memory (and leave a lot of free memory for standard Spark operations).

like image 117
user7891025 Avatar answered Oct 18 '25 17:10

user7891025



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!