Let's say I have two RDDs with size M1 and M2, distributed equally into p partitions. I'm interested in knowing that (theoretically / approximately) what is the cost of the operations filter, map, leftOuterJoin, ++, reduceByKey, etc. Thanks for the help.

To measure the cost of execution it is important to understand how spark execution is performed. In a nutshell, when you execute a set of transformations on your RDDs spark will create an execution plan (aka DAG), and group them together in the form of stages which are executed once you trigger an action. Operations like map/filter/flatMap are grouped together to form one stage since they do not incur a shuffle, and operations like join, reduceByKey will create more stages because they involve data to be moved across executors. Spark executes <code>action</code> as a sequence of <code>stages</code> (which gets executed sequentially or parallely if they are independent of each other). And, each <code>stage</code> gets executed as a number of parallel <code>tasks</code> where number of tasks running at a time depends upon the partitions of RDD and resources available. Best way to measure the cost for your operations is to look at the SparkUI. Open the spark UI (by default it will be at localhost:4040 if you are running in local mode). You'll find several tabs on the top of the page, once you click on any of them you'll be directed to the page which will show you the corresponding metrics. Here is what I do to measure the performance: <ul> <li>Cost of a <code>Job</code> => Sum of costs of executing all its <code>stages</code>.</li> <li>Cost of a <code>Stage</code> => Mean of cost of executing each parallel <code>tasks</code> on the stage.</li> <li>Cost of a <code>Task</code> => By default, a task consumes one CPU core. Memory consumed is given in the UI which depends upon the size of your partition.</li> </ul> It is really difficult to derive metrics for each transformation within a <code>stage</code> since Spark combines these transformations and executes them together on a partition of RDD.

Where can I find the cost of the operations in Spark?

1 Answers

To measure the cost of execution it is important to understand how spark execution is performed.

In a nutshell, when you execute a set of transformations on your RDDs spark will create an execution plan (aka DAG), and group them together in the form of stages which are executed once you trigger an action.

Operations like map/filter/flatMap are grouped together to form one stage since they do not incur a shuffle, and operations like join, reduceByKey will create more stages because they involve data to be moved across executors. Spark executes action as a sequence of stages (which gets executed sequentially or parallely if they are independent of each other). And, each stage gets executed as a number of parallel tasks where number of tasks running at a time depends upon the partitions of RDD and resources available.

Best way to measure the cost for your operations is to look at the SparkUI. Open the spark UI (by default it will be at localhost:4040 if you are running in local mode). You'll find several tabs on the top of the page, once you click on any of them you'll be directed to the page which will show you the corresponding metrics.

Here is what I do to measure the performance:

Cost of a Job => Sum of costs of executing all its stages.
Cost of a Stage => Mean of cost of executing each parallel tasks on the stage.
Cost of a Task => By default, a task consumes one CPU core. Memory consumed is given in the UI which depends upon the size of your partition.

It is really difficult to derive metrics for each transformation within a stage since Spark combines these transformations and executes them together on a partition of RDD.

answered Oct 03 '22 17:10

shashwat

Related questions
                            
                                Angular 2 component styles from input
                            
                                How to get all links and their Wikidata IDs for a Wikipedia page?
                            
                                How to use adaboost with different base estimator in scikit-learn?
                            
                                Status bar shows on android when keyboard is visible
                            
                                How to get base64 of image in React Native
                            
                                Finding match by negating (based on a missing string)
                            
                                Remove 'X-Requested-With' from query string url during ajax request
                            
                                Issue with Flask Blueprints and login redirects
                            
                                How to extract the argument from a Cortana command with a phrase topic, activated via text?
                            
                                Disable autocomplete closing parenthesis around selected text
                            
                                Boost.asio and asynchronous chain, unique_ptr?
                            
                                Scripting with bcdedit

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Where can I find the cost of the operations in Spark?

Tags:

David Herskovics

People also ask

1 Answers

shashwat

Recent Activity

Donate For Us