Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark master memory requirements related to data size

Tags:

apache-spark

Are Spark master memory requirements related to the size of the processed data?

The Spark driver and Spark workers/executors deal with processed data directly (and execute application code), so their memory needs can be linked to the size of the processed data. But is the Spark master in any way affected by the data size? It seems to me that it isn't, because it just manages the Spark workers and doesn't work with the data itself directly.

like image 981
Markus Miller Avatar asked Mar 07 '17 21:03

Markus Miller


People also ask

How much memory do I need for Spark?

Memory. In general, Spark can run well with anywhere from 8 GiB to hundreds of gigabytes of memory per machine. In all cases, we recommend allocating only at most 75% of the memory for Spark; leave the rest for the operating system and buffer cache.

How large data can Spark handle?

How large a cluster can Spark scale to? Many organizations run Spark on clusters of thousands of nodes. The largest cluster we know has 8000 of them. In terms of data size, Spark has been shown to work well up to petabytes.

What happens if data do not fit in-memory in Spark?

Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level.

How is Spark executor memory determined?

According to the recommendations which we discussed above: Number of available executors = (total cores/num-cores-per-executor) = 150/5 = 30. Leaving 1 executor for ApplicationManager => --num-executors = 29. Number of executors per node = 30/10 = 3. Memory per executor = 64GB/3 = 21GB.


1 Answers

Spark main data entities like DataFrames or DataSets are based on RDD, or Resilient Distributed Datasets. They are distributed meaning the processing generally takes place in the executors.

Some RDD actions will end with data on the driver process though. Most notably collect and other actions that use it (like show, take or toPandas if you are using python). collect, as the name implies, will collect some or all of the rows of the distributed datasets and materialize them in the driver process. At this point, yes, you will need to take into account the memory footprint of your data.

This is why you will generally want to reduce as much as possible the data you collect. You can groupBy, filter, and many other transformations so that if you need to process the data in the driver, it's the most refined possible.

like image 113
Manu Valdés Avatar answered Oct 01 '22 19:10

Manu Valdés