I have been reading and trying to understand how does Spark framework use its cores in Standalone mode. According to Spark documentation, the parameter "spark.task.cpus"'s value is set to be 1 by default, which means number of cores to allocate for each task. Question 1: For a multi-core machine (e.g., 4 cores in total, 8 hardware threads), when "spark.task.cpus = 4", will Spark use 4 cores (1 thread per core) or 2 cores with hyper-thread? What will it happen if I set "spark.task.cpus = 16", more than the number of available hardware threads on this machine? Question 2: How is this type of hardware parallelism achieved? I tried to look into the code but couldn't find anything that communicates with the hardware or JVM for core-level parallelism. For example, if the task is "filter" function, how is a single filter task spitted to multiple cores or threads? Maybe I am missing something. Is this related to the Scala language?

To answer your title question, Spark by itself does not give you parallelism gains within a task. The main purpose of the <code>spark.task.cpus</code> parameter is to allow for tasks of multithreaded nature. If you call an external multithreaded routine within each task, or you want to encapsulate the finest level of parallelism yourself on the task level, you may want to set <code>spark.task.cpus</code> to more than 1. <ul> <li> Setting this parameter to more than 1 is not something you would do often, though. <ul> <li>The scheduler will not launch a task if the number of available cores is less than the cores required by the task, so if your executor has 8 cores, and you've set <code>spark.task.cpus</code> to 3, only 2 tasks will get launched.</li> <li>If your task does not consume the full capacity of the cores all the time, you may find that using <code>spark.task.cpus=1</code> and experiencing some contention within the task still gives you more performance.</li> <li>Overhead from things like GC or I/O probably shouldn't be included in the <code>spark.task.cpus</code> setting, because it'd probably be a much more static cost, that doesn't scale linearly with your task count.</li> </ul> </li> </ul> <blockquote> Question 1: For a multi-core machine (e.g., 4 cores in total, 8 hardware threads), when "spark.task.cpus = 4", will Spark use 4 cores (1 thread per core) or 2 cores with hyper-thread? </blockquote> The JVM will almost always rely on the OS to provide it with info and mechanisms to work with CPUs, and AFAIK Spark doesn't do anything special here. If <code>Runtime.getRuntime().availableProcessors()</code> or <code>ManagementFactory.getOperatingSystemMXBean().getAvailableProcessors()</code> return 4 for your dual-core HT-enabled Intel® processor, Spark will also see 4 cores. <blockquote> Question 2: How is this type of hardware parallelism achieved? I tried to look into the code but couldn't find anything that communicates with the hardware or JVM for core-level parallelism. For example, if the task is "filter" function, how is a single filter task spitted to multiple cores or threads? </blockquote> Like mentioned above, Spark won't automatically parallelize a task according to the <code>spark.task.cpus</code> parameter. Spark is mostly a data parallelism engine and its parallelism is achieved mostly through representing your data as RDDs.

How does Spark achieve parallelism within one task on multi-core or hyper-threaded machines

Tags:

multithreading

parallel-processing

apache-spark

multicore

I have been reading and trying to understand how does Spark framework use its cores in Standalone mode. According to Spark documentation, the parameter "spark.task.cpus"'s value is set to be 1 by default, which means number of cores to allocate for each task.

Question 1: For a multi-core machine (e.g., 4 cores in total, 8 hardware threads), when "spark.task.cpus = 4", will Spark use 4 cores (1 thread per core) or 2 cores with hyper-thread?

What will it happen if I set "spark.task.cpus = 16", more than the number of available hardware threads on this machine?

Question 2: How is this type of hardware parallelism achieved? I tried to look into the code but couldn't find anything that communicates with the hardware or JVM for core-level parallelism. For example, if the task is "filter" function, how is a single filter task spitted to multiple cores or threads?

Maybe I am missing something. Is this related to the Scala language?

210

asked Apr 17 '16 01:04

Nodame

1 Answers

To answer your title question, Spark by itself does not give you parallelism gains within a task. The main purpose of the spark.task.cpus parameter is to allow for tasks of multithreaded nature. If you call an external multithreaded routine within each task, or you want to encapsulate the finest level of parallelism yourself on the task level, you may want to set spark.task.cpus to more than 1.

Setting this parameter to more than 1 is not something you would do often, though.
- The scheduler will not launch a task if the number of available cores is less than the cores required by the task, so if your executor has 8 cores, and you've set spark.task.cpus to 3, only 2 tasks will get launched.
- If your task does not consume the full capacity of the cores all the time, you may find that using spark.task.cpus=1 and experiencing some contention within the task still gives you more performance.
- Overhead from things like GC or I/O probably shouldn't be included in the spark.task.cpus setting, because it'd probably be a much more static cost, that doesn't scale linearly with your task count.

Question 1: For a multi-core machine (e.g., 4 cores in total, 8 hardware threads), when "spark.task.cpus = 4", will Spark use 4 cores (1 thread per core) or 2 cores with hyper-thread?

The JVM will almost always rely on the OS to provide it with info and mechanisms to work with CPUs, and AFAIK Spark doesn't do anything special here. If Runtime.getRuntime().availableProcessors() or ManagementFactory.getOperatingSystemMXBean().getAvailableProcessors() return 4 for your dual-core HT-enabled Intel® processor, Spark will also see 4 cores.

Question 2: How is this type of hardware parallelism achieved? I tried to look into the code but couldn't find anything that communicates with the hardware or JVM for core-level parallelism. For example, if the task is "filter" function, how is a single filter task spitted to multiple cores or threads?

Like mentioned above, Spark won't automatically parallelize a task according to the spark.task.cpus parameter. Spark is mostly a data parallelism engine and its parallelism is achieved mostly through representing your data as RDDs.

110

answered Dec 05 '22 14:12

Dimitar Dimitrov

Related questions
                            
                                Faster fundamental datastructures on multicore machines?
                            
                                VBA + Threads in MS Access [duplicate]
                            
                                Is the Python GIL really per interpreter?
                            
                                How to designate a thread pool for actors
                            
                                Am I using Java PooledConnections correctly?
                            
                                How is pthread_join implemented?
                            
                                Entity Framework and multithreading
                            
                                C#: Adding context to Parallel.ForEach() in ASP.NET
                            
                                Is object creation a bottleneck in Java in multithreaded environment?
                            
                                Can I check the size of a collection as it is being populated
                            
                                How do you post a boost packaged_task to an io_service in C++03?
                            
                                Making a thread which cancels a InputStream.read() call if 'x' time has passed
                            
                                Call method after some delay in java
                            
                                Always call Thread.currentThread().interrupt(); when catching an InterruptedException?
                            
                                Java multithreading - thread priority
                            
                                How to stop std thread safely?
                            
                                error: use of deleted function ‘std::thread::thread(const std::thread&)'
                            
                                How to properly use Java Executor?
                            
                                Queue of async tasks with throttling which supports muti-threading
                            
                                What are the scalability benefits of async (non-blocking) code?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With