Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How Spark handle data larger than cluster memory

Tags:

apache-spark

If I have only 1 executor with memory 25 GB and if it can run only one task at a time then is it possible to process(transformation and action) 1 TB data if Yes then how it will be read and where intermediate data will be store ?

Also for the same scenario if hadoop file is having 300 input split then there will be 300 partitions in RDD so in this case where will be those partitions ? will it remain on hadoop disk only and my single task will be run 300 times ?

like image 946
Rahul Avatar asked Jul 07 '17 02:07

Rahul


1 Answers

I find a good answer on hortonworks website.

Contrary to popular believe Spark is not in-memory only

a) Simple read no shuffle ( no joins, ... )

For the initial reads Spark like MapReduce reads the data in a stream and > processes it as it comes along. I.e. unless there is a reason spark will NOT materialize the full RDDs in memory ( you can tell him to do it however if you want to cache a small dataset ) An RDD is resilient because spark knows how to recreate it ( re read a block from hdfs for example ) not because its stored in mem in different locations. ( that can be done too though. )

So if you filter out most of your data or do an efficient aggregation that aggregates on the map side you will never have the full table in memory.

b) Shuffle

This is done very similarly to MapReduce as it writes the map outputs to disc and reads them with the reducers through http. However spark uses an aggressive filesystem buffer strategy on the Linux filesystem so if the OS has memory available the data will not be actually written to physical disc.

c) After Shuffle

RDDs after shuffle are normally cached by the engine ( otherwise a failed node or RDD would require a complete re run of the job ) however as abdelkrim mentions Spark can spill these to disc unless you overrule that.

d) Spark Streaming

This is a bit different. Spark streaming expects all data to fit in memory unless you overwrite settings.

Here's is the original page.

And the initial Spark's design dissertation by Matei Zaharia also helps. (section 2.6.4 Behavior with Insufficient Memory)

Wish there is something useful.

like image 63
neilron Avatar answered Nov 15 '22 09:11

neilron