Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does the Apache Spark scheduler split files into tasks?

In spark-summit 2014, Aaron gives the speak A Deeper Understanding of Spark Internals , in his slide, page 17 show a stage has been splited into 4 tasks as bellow:
enter image description here

Here I wanna know three things about how does a stage be splited into tasks?

  1. in this example above, it seems that tasks' number are created based on the file number, am I right?

  2. if I'm right in point 1, so if there was just 3 files under directory names, will it just create 3 tasks?

  3. If I'm right in point 2, what if there is just one but very large file? Does it just split this stage into 1 task? And what if when the data is coming from a streaming data source?

thanks a lot, I feel confused in how does the stage been splited into tasks.

like image 201
taotao.li Avatar asked Jan 30 '15 01:01

taotao.li


1 Answers

You can configure the # of partitions (splits) for the entire process as the second parameter to a job, e.g. for parallelize if we want 3 partitions:

a = sc.parallelize(myCollection, 3)

Spark will divide the work into relatively even sizes (*) . Large files will be broken down accordingly - you can see the actual size by:

rdd.partitions.size

So no you will not end up with single Worker chugging away for a long time on a single file.

(*) If you have very small files then that may change this processing. But in any case large files will follow this pattern.

like image 72
WestCoastProjects Avatar answered Oct 05 '22 13:10

WestCoastProjects