Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where can I find the cost of the operations in Spark?

Tags:

Let's say I have two RDDs with size M1 and M2, distributed equally into p partitions.

I'm interested in knowing that (theoretically / approximately) what is the cost of the operations filter, map, leftOuterJoin, ++, reduceByKey, etc.

Thanks for the help.

like image 227
David Herskovics Avatar asked May 07 '16 18:05

David Herskovics


People also ask

Is Count costly operation in Spark?

It is a distributed model in PySpark where actions are distributed, and all the data are brought back to the driver node. The data shuffling operation sometimes makes the count operation costlier for the data model.

What is cost model in Spark?

The cost model covers the class of Generalized Projection, Selection, Join (GPSJ) queries. The cost model keeps into account the network and IO costs as well as the most relevant CPU costs. The execution cost is computed starting from a physical plan produced by Spark.

What are two types of operations in Apache spark?

Two types of Apache Spark RDD operations are- Transformations and Actions. A Transformation is a function that produces new RDD from the existing RDDs but when we want to work with the actual dataset, at that point Action is performed.

What are RDD operations in Spark?

RDD Operations. RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.


1 Answers

To measure the cost of execution it is important to understand how spark execution is performed.

In a nutshell, when you execute a set of transformations on your RDDs spark will create an execution plan (aka DAG), and group them together in the form of stages which are executed once you trigger an action.

Operations like map/filter/flatMap are grouped together to form one stage since they do not incur a shuffle, and operations like join, reduceByKey will create more stages because they involve data to be moved across executors. Spark executes action as a sequence of stages (which gets executed sequentially or parallely if they are independent of each other). And, each stage gets executed as a number of parallel tasks where number of tasks running at a time depends upon the partitions of RDD and resources available.

Best way to measure the cost for your operations is to look at the SparkUI. Open the spark UI (by default it will be at localhost:4040 if you are running in local mode). You'll find several tabs on the top of the page, once you click on any of them you'll be directed to the page which will show you the corresponding metrics.

Here is what I do to measure the performance:

  • Cost of a Job => Sum of costs of executing all its stages.
  • Cost of a Stage => Mean of cost of executing each parallel tasks on the stage.
  • Cost of a Task => By default, a task consumes one CPU core. Memory consumed is given in the UI which depends upon the size of your partition.

It is really difficult to derive metrics for each transformation within a stage since Spark combines these transformations and executes them together on a partition of RDD.

like image 50
shashwat Avatar answered Oct 03 '22 17:10

shashwat