Spark Task Memory allocation

Tags:

I am trying to find out the best way to configure the memory on the nodes of my cluster. However I believe that for that there some things that I need to further understand such as how spark handles memory across tasks.

For example, let's say I have 3 executors, each executor can run up to 8 tasks in parallel (i.e. 8 cores). If I have an RDD with 24 partitions, this means that theoretically all partitions can be processed in parallel. However if we zoom into one executor here, this assumes that each task can have their partition in memory for operating on them. If not then it means that 8 tasks in parallel will not happen and some scheduling will be required.

Hence I concluded that If one sought true parallelism having some idea about the size of your partitions would be helpful as it would inform you as to how to size your executor for true parallelism.

Q0 - I simply want to understand a little better, what happens when not all the partitions can fit in memory within one executor? Are some spilled on disk while others are in memory for operations? Does spark
reserve the memory for each task, and if it detects that there is not enough, does it schedule the tasks? Or do one simply run in an out
of memory error.
Q1 - Does true parallelism within an executor depend also on the amount of memory available on the executor? In other words, I may have 8 cores in my cluster, but if I do not have enough memory to load 8 partitions of my data at once, then I won't be fully parallel.

As a final note I have seen several times the following statement but find it confusing a little:

“Increasing the number of partitions can also help reduce out-of-memory errors, since it means that Spark will operate on a smaller subset of the data for each executor.”

How does that work exactly ? I mean spark may work on smaller subset but if the total set of partitions can't fit in memory anyway, what happens?

823

asked Aug 07 '17 18:08

MaatDeamon

1 Answers

Why should I increase number of tasks (partitions)?

I would like to answer first on the last question that is confusing you. Here is a quote from another question:

Spark does not need to load everything in memory to be able to process it. This is because Spark will partition the data into smaller blocks and operate on these separately.

In fact, by default Spark tries to split input data automatically into some optimal number of partitions:

Spark automatically sets the number of “map” tasks to run on each file according to its size

One can specify number of partitions of the operation that is being performed (like for cogroup: def cogroup[W](other: RDD[(K, W)], numPartitions: Int)), and also do a .repartition() after any RDD transformation.

Moreover, later in the same paragraph of the documentation they say:

In general, we recommend 2-3 tasks per CPU core in your cluster.

In summary:

the default number of partitions is a good start;
2-3 partitions per CPU is generally recommended.

How does Spark deal with inputs that do not fit in memory?

In short, by partitioning input and intermediate results (RDDs). Usually each small chunk fits in memory available for the executor and is processed fastly.

Spark is capable of caching the RDDs it has computed. By default every time an RDD is being reused it will be recomputed (is not cached); calling .cache() or .persist() can help to keep the result already computed in-memory or on disk.

Internally each executor has a memory pool that floats between execution and storage (see here for more details). When there is not enough memory for a task execution, Spark first tries to evict some storage cache, and then spills task data on disk. See these slides for further details. Balancing between executor and storage memory is well described in this blog post, which also has a nice illustration:

Spark memory allocation

OutOfMemory often happens not directly because of large input data, but because of poor partitioning and hence large auxiliary data structures, like HashMap on reducers (here documentation again advises to have more partitions than executors). So, no, OutOfMemory will not happen just because of big input, it may be very slow to process though (since it will have to write/read from disk). They also suggest that using tasks as small as 200ms (in running time) is Ok for Spark.

Outline: split your data properly: more than 1 partition per core, running time of each task should be >200 ms. Default partitioning is a good starting point, tweak the parameters manually.

(I would suggest to use a 1/8 subset of input data on a 1/8 cluster to find optimal number of partitions.)

Do tasks within same executor affect each other?

Short answer: they do. For more details, check out the slides I mentioned above (starting from slide #32).

All N tasks get N-th portion of the memory available, hence affect each other's "parallelism". If I interpret your idea of true parallelism well, it is "full utilization of CPU resources". In this case, yes, small pool of memory will result in spilling data on disk and the computations becoming IO-bound (instead of being CPU-bound).

Related questions
                            
                                create hive external table with schema in spark
                            
                                Pyspark command not recognised
                            
                                Scala: How to get a range of rows in a dataframe
                            
                                PYSPARK : casting string to float when reading a csv file
                            
                                Creating a Spark DataFrame from a single string
                            
                                pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989
                            
                                What's the difference among ShuffledRDD, MapPartitionsRDD and ParallelCollectionRDD?
                            
                                Spark - GraphX - scaling connected components
                            
                                How to GROUPING SETS as operator/method on Dataset?
                            
                                How to convert from org.apache.spark.mllib.linalg.VectorUDT to ml.linalg.VectorUDT
                            
                                Spark: Is the memory required to create a DataFrame somewhat equal to the size of the input data?
                            
                                Convert Sparse Vector to Dense Vector in Pyspark
                            
                                Passing a list of tuples as a parameter to a spark udf in scala
                            
                                How to create a table as select in pyspark.sql
                            
                                How to save CSV with all fields quoted?
                            
                                PySpark: Get first Non-null value of each column in dataframe
                            
                                How to fill none values with a concrete timestamp in DataFrame?
                            
                                What is the meaning for reduceByKey(_ ++ _)
                            
                                need instance of RDD but returned class 'pyspark.rdd.PipelinedRDD'
                            
                                Spark - Read csv file with quote

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark Task Memory allocation

Tags:

apache-spark

spark-streaming

MaatDeamon

People also ask

1 Answers

Why should I increase number of tasks (partitions)?

How does Spark deal with inputs that do not fit in memory?

Do tasks within same executor affect each other?

Further reading

Nikolay Vasiliev

Recent Activity

Donate For Us