how to limit the number of concurrent map tasks per executor?

Tags:

A map operation in my Spark APP takes an RDD[A] as input and map each element in RDD[A] using a custom mapping function func(x:A):B to another object of type B. Because func() requires significant amount of memory when computing each input x, I want to limit the number of concurrent map tasks per executor such that the total amount of memory required by all tasks on the same executor does not exceeds the amount of physical memory available on the node.

I checked out available spark configurations, but not sure which one to use. Does using coalesce(numPartitions) to set the number of partitions for RDD[A] fulfil the purpose?

560

asked Jan 02 '15 06:01

PC Yin

1 Answers

The number of concurrent tasks per executor is related to the available number of cores, not the number of tasks, so changing the parallelism level using coalesce or repartition will not help in constraining the used memory for each task, only the amount of data on each partition that needs to be processed by a given task (*).

As far as I know, there's no way to constrain the memory used by a single task, because it's sharing the resources of the worker JVM, and hence sharing memory with the other tasks on the same executor.

Assuming a fair share per task, a guideline for the amount of memory available per task (core) will be:

spark.executor.memory * spark.storage.memoryFraction / #cores-per-executor

Probably, a way to force less tasks per executor, and hence more memory available per task, would be to assign more cores per task, using spark.task.cpus (default = 1)

(*) Given that the concern here is at the level of each element x of an RDD, the only possible setting that could affect memory usage is to set a parallelism level less than the number of CPUs of a single executor, but that would result in severe under-utilization of the cluster resources as all workers but one will be idle.

137

answered Nov 15 '22 09:11

maasg

Related questions
                            
                                Parsing PDF files in Hadoop Map Reduce
                            
                                Developer and admin GUI tools for Hadoop
                            
                                Joining two Tables in Hive using HiveQL(Hadoop) [duplicate]
                            
                                Using Hive for real time queries
                            
                                Hadoop ChainMapper, ChainReducer [duplicate]
                            
                                Hadoop Mapreduce multiple Input files
                            
                                How do I avoid a cursor timeout on a long-running mapreduce operation?
                            
                                Map reduce job getting stuck at map 0% reduce 0%
                            
                                Hadoop Java vs C/C++ on cpu-intensive tasks
                            
                                mapreduce in java - gzip input files
                            
                                It's possible only install Hadoop HDFS?
                            
                                os.environ['mapreduce_map_input_file'] doesn't work
                            
                                How to do performance profiling of Hadoop cluster
                            
                                Do mappers store it's intermediate outputs on datanode's RAM on which it is running?
                            
                                Python Hadoop streaming on windows, Script not a valid Win32 application
                            
                                Map-Reduce count number of documents in each minute MongoDB
                            
                                Is the input format responsible for implementing data locality in Hadoop's MapReduce?
                            
                                Select distinct more than one field using MongoDB's map reduce
                            
                                How to create a RavenDB index that returns a list of strings?
                            
                                Hadoop maps are failing due to ConnectException

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to limit the number of concurrent map tasks per executor?

Tags:

apache-spark

mapreduce

PC Yin

People also ask

1 Answers

maasg

Recent Activity

Donate For Us