Let's say I have two RDDs with size M1 and M2, distributed equally into p partitions.
I'm interested in knowing that (theoretically / approximately) what is the cost of the operations filter, map, leftOuterJoin, ++, reduceByKey, etc.
Thanks for the help.
It is a distributed model in PySpark where actions are distributed, and all the data are brought back to the driver node. The data shuffling operation sometimes makes the count operation costlier for the data model.
The cost model covers the class of Generalized Projection, Selection, Join (GPSJ) queries. The cost model keeps into account the network and IO costs as well as the most relevant CPU costs. The execution cost is computed starting from a physical plan produced by Spark.
Two types of Apache Spark RDD operations are- Transformations and Actions. A Transformation is a function that produces new RDD from the existing RDDs but when we want to work with the actual dataset, at that point Action is performed.
RDD Operations. RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.
To measure the cost of execution it is important to understand how spark execution is performed.
In a nutshell, when you execute a set of transformations on your RDDs spark will create an execution plan (aka DAG), and group them together in the form of stages which are executed once you trigger an action.
Operations like map/filter/flatMap are grouped together to form one stage since they do not incur a shuffle, and operations like join, reduceByKey will create more stages because they involve data to be moved across executors. Spark executes action
as a sequence of stages
(which gets executed sequentially or parallely if they are independent of each other). And, each stage
gets executed as a number of parallel tasks
where number of tasks running at a time depends upon the partitions of RDD and resources available.
Best way to measure the cost for your operations is to look at the SparkUI. Open the spark UI (by default it will be at localhost:4040 if you are running in local mode). You'll find several tabs on the top of the page, once you click on any of them you'll be directed to the page which will show you the corresponding metrics.
Here is what I do to measure the performance:
Job
=> Sum of costs of executing all its stages
.Stage
=> Mean of cost of executing each parallel tasks
on the stage.Task
=> By default, a task consumes one CPU core. Memory consumed is given in the UI which depends upon the size of your partition.It is really difficult to derive metrics for each transformation within a stage
since Spark combines these transformations and executes them together on a partition of RDD.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With