Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

prioritizing partitions / task execution in spark

I have a spark job with skewed data. The data needs to be partitioned based on a column. I would like to tell spark to start processing the biggest partitions first so that I get to use the available resources more efficiently.

The reasoning would go as follows: I have 10000 partitions in total of which 9999 partitions take just 1 minute to process and 1 partition that takes 10 minutes to process. If I get the heavy partition first I can do the job in 11 minutes if I get it as last it would take 18 minutes.

Is there a way to prioritize partitions? Does this make sense to you?

I sketched the two scenarios on a spreadsheetenter image description here

like image 502
cadama Avatar asked Aug 16 '18 18:08

cadama


2 Answers

Your reasoning is correct afa: if the big task were started immediately then your overall job will finish ealier. But it's also true that you can not control the ordering (/prioritization) of the tasks - since the spark task scheduler does not provide an interface to define that ordering.

like image 179
WestCoastProjects Avatar answered Oct 23 '22 04:10

WestCoastProjects


Long-running tasks often a result of skewed data. The right solution here is to re-partition your data to ensure even distribution among tasks.

1.Evenly distribute your data using repartition as said by @Chandan
2.There might be encounter network issues while dealing with skewed data 
where an executor’s heartbeat times out.In such cases, consider increasing
your **spark.network.timeout** and **spark.executor.heartbeatInterval**.

Important suggestion look for data locality level. The locality level, as far as I know, indicates which type of access to data has been performed. When a node finishes all its work and its CPU become idle, Spark may decide to start other pending task that require obtaining data from other places. So ideally, all your tasks should be process local as it is associated with lower data access latency.

You can configure the wait time before moving to other locality levels using:

spark.locality.wait

1.Spark official docs on data locality

2.Explanation on data locality refer

like image 21
devesh Avatar answered Oct 23 '22 03:10

devesh