I don't quite understand <code>spark.task.cpus</code> parameter. It seems to me that a “task” corresponds to a “thread” or a "process", if you will, within the executor. Suppose that I set "spark.task.cpus" to 2. <ol> <li>How can a thread utilize two CPUs simultaneously? Couldn't it require locks and cause synchronization problems?</li> <li>I'm looking at <code>launchTask()</code> function in deploy/executor/Executor.scala, and I don't see any notion of "number of cpus per task" here. So where/how does Spark eventually allocate more than one cpu to a task in the standalone mode?</li> </ol>

To the best of my knowledge <code>spark.task.cpus</code> controls the parallelism of tasks in you cluster in the case where some particular tasks are known to have their own internal (custom) parallelism. In more detail: We know that <code>spark.cores.max</code> defines how many threads (aka cores) your application needs. If you leave <code>spark.task.cpus = 1</code> then you will have #spark.cores.max number of concurrent Spark tasks running at the same time. You will only want to change <code>spark.task.cpus</code> if you know that your tasks are themselves parallelized (maybe each of your task spawns two threads, interacts with external tools, etc.) By setting <code>spark.task.cpus</code> accordingly, you become a good "citizen". Now if you have spark.cores.max=10 and spark.task.cpus=2 Spark will only create 10/2=5 concurrent tasks. Given that your tasks need (say) 2 threads internally the total number of executing threads will never be more than 10. This means that you never go above your initial contract (defined by <code>spark.cores.max</code>).

Number of CPUs per Task in Spark

Tags:

I don't quite understand spark.task.cpus parameter. It seems to me that a “task” corresponds to a “thread” or a "process", if you will, within the executor. Suppose that I set "spark.task.cpus" to 2.

How can a thread utilize two CPUs simultaneously? Couldn't it require locks and cause synchronization problems?
I'm looking at launchTask() function in deploy/executor/Executor.scala, and I don't see any notion of "number of cpus per task" here. So where/how does Spark eventually allocate more than one cpu to a task in the standalone mode?

355

asked Apr 17 '16 01:04

smz

1 Answers

To the best of my knowledge spark.task.cpus controls the parallelism of tasks in you cluster in the case where some particular tasks are known to have their own internal (custom) parallelism.

In more detail: We know that spark.cores.max defines how many threads (aka cores) your application needs. If you leave spark.task.cpus = 1 then you will have #spark.cores.max number of concurrent Spark tasks running at the same time.

You will only want to change spark.task.cpus if you know that your tasks are themselves parallelized (maybe each of your task spawns two threads, interacts with external tools, etc.) By setting spark.task.cpus accordingly, you become a good "citizen". Now if you have spark.cores.max=10 and spark.task.cpus=2 Spark will only create 10/2=5 concurrent tasks. Given that your tasks need (say) 2 threads internally the total number of executing threads will never be more than 10. This means that you never go above your initial contract (defined by spark.cores.max).

114

answered Oct 07 '22 18:10

marios

Related questions
                            
                                Getting error: Missing Authentication Token after AWS API request
                            
                                How to create a custom Estimator in PySpark
                            
                                Why there is no Source Code Management tab in a Jenkins pipeline job?
                            
                                Which is the best way to add a retry/rollback mechanism for sync/async tasks in C#?
                            
                                How to debug async/await in visual studio code?
                            
                                Referencing Other Environment Variables in Systemd
                            
                                set list as value in a column of a pandas dataframe
                            
                                Numpy Cannot cast ufunc multiply output from dtype
                            
                                Java 8 Nested (Multi level) group by
                            
                                Jenkins pipeline script fails with "General error during class generation: Method code too large!"
                            
                                What is CA certificate, and why do we need it?
                            
                                Add CSS to react created elements like data-reactroot

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Number of CPUs per Task in Spark

Tags:

smz

People also ask

1 Answers

marios

Recent Activity

Donate For Us