My use case as mentioned below.
The issue i am facing here is in deciding the number of partitions to be applied on the input data. The input data size varies every time and hard coding a particular value is not an option. And spark performs really well only when certain optimum partition is applied on the input data for which i have to perform lots of iteration(trial and error). Which is not an option in a production environment.
My question: Is there a thumb rule to decide the number of partitions required depending on the input data size and cluster resources available(executors,cores, etc...)? If yes please point me in that direction. Any help is much appreciated.
I am using spark 1.0 on yarn.
Thanks, AG
Two notes from Tuning Spark in the Spark official documentation:
1- In general, we recommend 2-3 tasks per CPU core in your cluster.
2- Spark can efficiently support tasks as short as 200 ms, because it reuses one executor JVM across many tasks and it has a low task launching cost, so you can safely increase the level of parallelism to more than the number of cores in your clusters.
These are two rule of tumb that help you to estimate the number and size of partitions. So, It's better to have small tasks (that could be completed in hundred ms).
Determining the number of partitions is a bit tricky. Spark by default will try and infer a sensible number of partitions. Note: if you are using the textFile method with compressed text then Spark will disable splitting and then you will need to re-partition (it sounds like this might be whats happening?). With non-compressed data when you are loading with sc.textFile you can also specify a minium number of partitions (e.g. sc.textFile(path, minPartitions) ).
The coalesce function is only used to reduce the number of partitions, so you should consider using the repartition() function.
As far as choosing a "good" number you generally want at least as many as the number of executors for parallelism. There already exists some logic to try and determine a "good" amount of parallelism, and you can get this value by calling sc.defaultParallelism
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With