Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Number of CPUs per Task in Spark

Tags:

I don't quite understand spark.task.cpus parameter. It seems to me that a “task” corresponds to a “thread” or a "process", if you will, within the executor. Suppose that I set "spark.task.cpus" to 2.

  1. How can a thread utilize two CPUs simultaneously? Couldn't it require locks and cause synchronization problems?

  2. I'm looking at launchTask() function in deploy/executor/Executor.scala, and I don't see any notion of "number of cpus per task" here. So where/how does Spark eventually allocate more than one cpu to a task in the standalone mode?

like image 355
smz Avatar asked Apr 17 '16 01:04

smz


People also ask

How does Spark calculate number of tasks?

Spark automatically sets the number of “map” tasks to run on each file according to its size (though you can control it through optional parameters to SparkContext. textFile , etc), and for distributed “reduce” operations, such as groupByKey and reduceByKey , it uses the largest parent RDD's number of partitions.

What is number of cores in Spark?

The consensus in most Spark tuning guides is that 5 cores per executor is the optimum number of cores in terms of parallel processing.

How many partitions does a single task work on Spark?

Partitions in Spark do not span multiple machines. Tuples in the same partition are guaranteed to be on the same machine. Spark assigns one task per partition and each worker can process one task at a time.

Does Spark use multiple cores?

Spark does not require the users to have high end, expensive systems with great computing power. It splits the big data into multiple cores or systems available in the cluster and optimally utilizes these computing resources to the processes this data in a distributed manner.


1 Answers

To the best of my knowledge spark.task.cpus controls the parallelism of tasks in you cluster in the case where some particular tasks are known to have their own internal (custom) parallelism.

In more detail: We know that spark.cores.max defines how many threads (aka cores) your application needs. If you leave spark.task.cpus = 1 then you will have #spark.cores.max number of concurrent Spark tasks running at the same time.

You will only want to change spark.task.cpus if you know that your tasks are themselves parallelized (maybe each of your task spawns two threads, interacts with external tools, etc.) By setting spark.task.cpus accordingly, you become a good "citizen". Now if you have spark.cores.max=10 and spark.task.cpus=2 Spark will only create 10/2=5 concurrent tasks. Given that your tasks need (say) 2 threads internally the total number of executing threads will never be more than 10. This means that you never go above your initial contract (defined by spark.cores.max).

like image 114
marios Avatar answered Oct 07 '22 18:10

marios