Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark: understanding partitioning - cores

I'd like to understand partitioning in Spark. I am running spark in local mode on windows 10. My laptop has 2 physical cores and 4 logical cores.

1/ Terminology : to me, a core in spark = a thread. So a core in Spark is different than a physical core, right? A Spark core is associated to a task, right? If so, since you need a thread for a partition, if my sparksql dataframe has 4 partitions, it needs 4 threads right?

2/ If I have 4 logical cores, does it mean that I can only run 4 concurrent threads at the same time on my laptop? So 4 in Spark?

3/ Setting the number of partitions : how to choose the number of partitions of my dataframe, so that further transformations and actions run as fast as possible? -Should it have 4 partitions since my laptop has 4 logical cores? -Is the number of partitions related to physical cores or logical cores? -In spark documentations, it's written that you need 2-3 tasks per CPU. Since I have two physical coresn should the nb of partitions be equal to 4or6?

(I know that number of partitions will not have much effect on local mode, but this is just to understand)

like image 497
Jack Hoe Avatar asked Oct 29 '22 00:10

Jack Hoe


1 Answers

  1. Theres no such thing as a "spark core". If you are referring to options like --executor-cores then yes, that refers to how many tasks each executor will run concurrently.

  2. You can set the number of concurrent tasks to whatever you want, but more than the number of logical cores you have probably won't give and advantage.

  3. Number of partitions to use is situational. Without knowing the data or the transformations you are doing it's hard to give a number. Typical advice is to use just below a multiple of your total cores., for example, if you have 16 cores, maybe 47, 79, 127 and similar numbers just under a multiple of 16 are good to use. The reason for this is you want to make sure all cores are working (as little time as possible do you have resources idle, waiting for others to finish). but you leave a little extra to allow for speculative execution (spark may decide to run the same task twice if it is running slowly to see if it will go faster on a second try).

Picking the number is a bit of trial and error though, Take advantage of the spark job server to monitor how your tasks are running. Having few tasks with many of records each means you should probably increase the number of partitions, on the other hand, many partitions with only a few records each is also bad and you should try to reduce the partitioning in these cases.

like image 127
puhlen Avatar answered Nov 15 '22 07:11

puhlen