I try to run <code>connected components</code> on <code>logNormalGraph</code>. <pre class="prettyprint"><code>val graph: Graph[Long, Int] = GraphGenerators. logNormalGraph(context.spark, numEParts = 10, numVertices = 1000000, mu = 0.01, sigma = 0.01) val minGraph = graph.connectedComponents() </code></pre> And in the spark UI for every next job I can see constantly growing number of skipped stages <pre class="prettyprint"><code>1 - 4/4 (12 skipped) 2 - 4/4 (23 skipped) ... 50 - 4/4 (4079 skipped) </code></pre> Why there are so many skipped stages when I run something on Pregel and why this number is growing so fast (non-linearly)?

Step by step. <code>connectedComponents</code> function is implemented using Pregel API. Ignoring algorithm specific details it iteratively: <ul> <li><code>joinVertices</code> caching the output</li> <li> <code>mapReduceTriplets</code> over <code>messages</code> </li> </ul> First lets create dummy <code>sendMsg</code>: <pre class="prettyprint"><code>import org.apache.spark.graphx._ def sendMsg(edge: EdgeTriplet[VertexId, Int]): Iterator[(VertexId, VertexId)] = { Iterator((edge.dstId, edge.srcAttr)) } </code></pre> <code>vprog</code>: <pre class="prettyprint"><code>val vprog = (id: Long, attr: Long, msg: Long) => math.min(attr, msg) </code></pre> and <code>megeMsg</code>: <pre class="prettyprint"><code>val mergeMsg = (a: Long, b: Long) => math.min(a, b) </code></pre> Next we can initialize example graph: <pre class="prettyprint"><code>import org.apache.spark.graphx.util.GraphGenerators val graph = GraphGenerators.logNormalGraph( sc, numEParts = 10, numVertices = 100, mu = 0.01, sigma = 0.01) .mapVertices { case (vid, _) => vid } val g0 = graph .mapVertices((vid, vdata) => vprog(vid, vdata, Long.MaxValue)) .cache() </code></pre> and messages: <pre class="prettyprint"><code>val messages0 = g0.mapReduceTriplets(sendMsg, mergeMsg).cache() </code></pre> Since <code>GraphXUtils</code> are private we have to use <code>Graph</code> methods directly. When you take a look at the DAG generated by <pre class="prettyprint"><code>messages0.count </code></pre> you'll already see some skipped stages: <img src="https://i.stack.imgur.com/KAYMs.png" alt="enter image description here"> After executing the first iteration <pre class="prettyprint"><code>val g1 = g0.joinVertices(messages0)(vprog).cache() val messages1 = g1.mapReduceTriplets(sendMsg, mergeMsg).cache() messages1.count </code></pre> graph will look more or less like this: <img src="https://i.stack.imgur.com/FSVXV.png" alt="enter image description here"> If we continue: <pre class="prettyprint"><code>val g2 = g1.joinVertices(messages1)(vprog).cache() val messages2 = g2.mapReduceTriplets(sendMsg, mergeMsg).cache() messages2.count </code></pre> we get following DAG: <img src="https://i.stack.imgur.com/jrAJy.png" alt="enter image description here"> So what happened here: <ul> <li>we execute iterative algorithm which takes a dependency on the same data twice, once for join and once for message aggregation. This leads to increasing number of stages on which <code>g</code> depends in each iteration</li> <li>since data is intensively cached (explicitly as you can see in the code, explicitly by persisting shuffle files) and checkpointed (I could be wrong here, but checkpoints are typically marked as green dots) each stage has to be computed only once, even if multiple downstream stages depend on it.</li> <li>after data is initialized (<code>g0</code>, <code>messages0</code>) only the the latest stages are computed from scratch.</li> <li>if you take a closer look at <code>DAG</code> you'll see that there are quite complex dependencies which should account for remaining discrepancies between relatively slow growth of DAG and number of skipped stages.</li> </ul> The first property explains growing number of stages, the second one the fact that stages are skipped.

Many skipped stages for Pregel in Spark UI

Tags:

apache-spark

spark-graphx

I try to run connected components on logNormalGraph.

val graph: Graph[Long, Int] = GraphGenerators.
    logNormalGraph(context.spark, numEParts = 10, numVertices = 1000000,
        mu = 0.01, sigma = 0.01)

val minGraph = graph.connectedComponents()

And in the spark UI for every next job I can see constantly growing number of skipped stages

1 - 4/4 (12 skipped)
2 - 4/4 (23 skipped)
...
50 - 4/4 (4079 skipped)

Why there are so many skipped stages when I run something on Pregel and why this number is growing so fast (non-linearly)?

625

asked Apr 12 '16 13:04

Alexander Ponomarev

1 Answers

Step by step. connectedComponents function is implemented using Pregel API. Ignoring algorithm specific details it iteratively:

joinVertices caching the output
mapReduceTriplets over messages

First lets create dummy sendMsg:

import org.apache.spark.graphx._

def sendMsg(edge: EdgeTriplet[VertexId, Int]): 
    Iterator[(VertexId, VertexId)] = {
  Iterator((edge.dstId, edge.srcAttr))
}

vprog:

val vprog =  (id: Long, attr: Long, msg: Long) => math.min(attr, msg)

and megeMsg:

val mergeMsg = (a: Long, b: Long) => math.min(a, b)

Next we can initialize example graph:

import org.apache.spark.graphx.util.GraphGenerators

val graph = GraphGenerators.logNormalGraph(
   sc, numEParts = 10, numVertices = 100,  mu = 0.01, sigma = 0.01)
  .mapVertices { case (vid, _) => vid }

val g0 = graph
  .mapVertices((vid, vdata) => vprog(vid, vdata, Long.MaxValue))
  .cache()

and messages:

val messages0 = g0.mapReduceTriplets(sendMsg, mergeMsg).cache()

Since GraphXUtils are private we have to use Graph methods directly.

When you take a look at the DAG generated by

messages0.count

you'll already see some skipped stages:

enter image description here

After executing the first iteration

val g1 = g0.joinVertices(messages0)(vprog).cache()
val messages1 = g1.mapReduceTriplets(sendMsg, mergeMsg).cache()
messages1.count

graph will look more or less like this:

enter image description here

If we continue:

val g2 = g1.joinVertices(messages1)(vprog).cache()
val messages2 = g2.mapReduceTriplets(sendMsg, mergeMsg).cache()
messages2.count

we get following DAG:

enter image description here

So what happened here:

we execute iterative algorithm which takes a dependency on the same data twice, once for join and once for message aggregation. This leads to increasing number of stages on which g depends in each iteration
since data is intensively cached (explicitly as you can see in the code, explicitly by persisting shuffle files) and checkpointed (I could be wrong here, but checkpoints are typically marked as green dots) each stage has to be computed only once, even if multiple downstream stages depend on it.
after data is initialized (g0, messages0) only the the latest stages are computed from scratch.
if you take a closer look at DAG you'll see that there are quite complex dependencies which should account for remaining discrepancies between relatively slow growth of DAG and number of skipped stages.

The first property explains growing number of stages, the second one the fact that stages are skipped.

111

answered Sep 20 '22 17:09

zero323

Related questions
                            
                                Pyspark Error:- dataType <class 'pyspark.sql.types.StringType'> should be an instance of <class 'pyspark.sql.types.DataType'>
                            
                                aws: EMR cluster fails "ERROR UserData: Error encountered while try to get user data" on submitting spark job
                            
                                How to use foreach or foreachBatch in PySpark to write to database?
                            
                                Why is repartition faster than partitionBy in Spark?
                            
                                How to parallelize an RDD?
                            
                                How to rename huge amount of files in Hadoop/Spark?
                            
                                Spark - How to use the trained recommender model in production?
                            
                                Shuffled vs non-shuffled coalesce in Apache Spark
                            
                                Change Iterable[(String, Double)] of an RDD to Array or List
                            
                                Spark on embedded mode - user/hive/warehouse not found
                            
                                What happens if an RDD can't fit into memory in Spark? [duplicate]
                            
                                How to upload files to new EMR cluster
                            
                                pyspark split a column to multiple columns without pandas
                            
                                spark.storage.memoryFraction setting in Apache Spark
                            
                                spark returns error libsnappyjava.so: failed to map segment from shared object: Operation not permitted
                            
                                How to convert a sparse vector to dense in Scala Spark?
                            
                                Spark looses all executors one minute after starting
                            
                                how to obtain the trained best model from a crossvalidator
                            
                                spark group multiple rdd items by key
                            
                                no valid constructor on spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Many skipped stages for Pregel in Spark UI

Tags:

apache-spark

spark-graphx

Alexander Ponomarev

People also ask

1 Answers

zero323

Recent Activity

Donate For Us