Spark example program runs very slow

Tags:

I tried to use Spark to work on simple graph problem. I found an example program in Spark source folder: transitive_closure.py, which computes the transitive closure in a graph with no more than 200 edges and vertices. But in my own laptop, it runs more than 10 minutes and doesn't terminate. The command line I use is: spark-submit transitive_closure.py.

I wonder why spark is so slow even when computing just such small transitive closure result? Is it a common case? Is there any configuration I miss?

The program is shown below, and can be found in spark install folder at their website.

from __future__ import print_function

import sys
from random import Random

from pyspark import SparkContext

numEdges = 200
numVertices = 100
rand = Random(42)


def generateGraph():
    edges = set()
    while len(edges) < numEdges:
        src = rand.randrange(0, numEdges)
        dst = rand.randrange(0, numEdges)
        if src != dst:
            edges.add((src, dst))
    return edges


if __name__ == "__main__":
    """
    Usage: transitive_closure [partitions]
    """
    sc = SparkContext(appName="PythonTransitiveClosure")
    partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
    tc = sc.parallelize(generateGraph(), partitions).cache()

    # Linear transitive closure: each round grows paths by one edge,
    # by joining the graph's edges with the already-discovered paths.
    # e.g. join the path (y, z) from the TC with the edge (x, y) from
    # the graph to obtain the path (x, z).

    # Because join() joins on keys, the edges are stored in reversed order.
    edges = tc.map(lambda x_y: (x_y[1], x_y[0]))

    oldCount = 0
    nextCount = tc.count()
    while True:
        oldCount = nextCount
        # Perform the join, obtaining an RDD of (y, (z, x)) pairs,
        # then project the result to obtain the new (x, z) paths.
        new_edges = tc.join(edges).map(lambda __a_b: (__a_b[1][1], __a_b[1][0]))
        tc = tc.union(new_edges).distinct().cache()
        nextCount = tc.count()
        if nextCount == oldCount:
            break

    print("TC has %i edges" % tc.count())

    sc.stop()

832

asked Feb 22 '16 23:02

c21

1 Answers

There can many reasons why this code doesn't perform particularly well on your machine but most likely this is just another variant of the problem described in Spark iteration time increasing exponentially when using join. The simplest way to check if it is indeed the case is to provide spark.default.parallelism parameter on submit:

bin/spark-submit --conf spark.default.parallelism=2 \
  examples/src/main/python/transitive_closure.py

If not limited otherwise, SparkContext.union, RDD.join and RDD.union set a number of partitions of the child to the total number of partitions in the parents. Usually it is a desired behavior but can become extremely inefficient if applied iteratively.

102

answered Oct 20 '22 11:10

zero323

Related questions
                            
                                Why does eclipse flicker and is slow to load XML in Editors?
                            
                                Performance of ArrayList
                            
                                Fast and save way to remove the sign of a singed number in JavaScript
                            
                                REST API measuring server-side response times (performance).
                            
                                Django Query extremely slow
                            
                                How do you determine the correct local font names when preloading webfonts?
                            
                                128-bit struct or 2 64-bit records for performance and readibility
                            
                                jQuery mousemove performance - throttle events?
                            
                                large-scale document co-occurrence analysis
                            
                                Spawn WebWorker when needed or reuse the same one?
                            
                                Intel compiler produces code 68% slower than MSVC (full example provided)
                            
                                Will unused default arguments decrease performance c++
                            
                                How to optimize opening and closing excel workbooks to extract data to run faster
                            
                                How to clear chrome performance entries or bypass the limit on their number?
                            
                                ARIA attributes in ember core form components
                            
                                How would I improve the performance in the following scenario
                            
                                Performance considerations with `Promise.all` and a large amount of asynchronous operations
                            
                                Neo4j Match / Retrieving Query taking too much time 25 sec
                            
                                efficiently reading a large file into a Map
                            
                                Is PerformanceTiming.responseStart points to HTML or headers start?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark example program runs very slow

Tags:

performance

apache-spark

pyspark

transitive-closure

c21

People also ask

1 Answers

zero323

Recent Activity

Donate For Us