Partitioning with Spark Graphframes

Tags:

I'm working with a largish (?) graph (60 million vertices and 9.5 billion edges) using Spark Graphframes. The underlying data is not large - the vertices take about 500mb on disk and the edges are about 40gb. My containers are frequently shutting down due to java heap out of memory problems, but I think the underlying problem is that the graphframe is constantly shuffling data around (I'm seeing shuffle read/write of up to 150gb). Is there a way to efficiently partition a Graphframe or the underlying edges/vertices to reduce shuffle?

205

asked Dec 27 '16 20:12

John

2 Answers

TL;DR It is not possible to efficiently partition Graphframe.

Graphframe algorithms can be separated into two categories:

Methods which delegate processing to GraphX counterpart. GraphX supports a number of partitioning methods but these are not exposed via Graphframe API. If you use one of these it is probably better to use GraphX directly.

Unfortunately development of GraphX stopped almost completely with only a handful of small fixes over the last two years and overall performance is highly disappointing compared to both in-core libraries and out-of-core libraries.
Methods which are implemented natively using Spark Datasets, which considering limited programming model and only a single partitioning mode, are deeply unfit for complex graph processing.

While relational columnar storage can be used for efficient graph processing naive iterative join approach employed by Graphframes just don't scale (but it is OK for shallow traversing with one or two hops).'

You can try to repartition vertices and edges DataFrames by id and src respectively:
```
val nPart: Int = ???

GraphFrame(v.repartition(nPart, v("id")), e.repartition(e(nPart, "src")))
```
what should help in some cases.

Overall, at it's current (Dec, 2016) state, Spark is not a good choice for intensive graph analytics.

answered Sep 20 '22 06:09

user7347764

Here's the partial solution / workaround - create a UDF that mimics one of the partition functions to create a new column and partition on that.

num_parts = 256
random_vertex_cut = udf.register("random_vertex_cut", lambda src, dst: math.abs((src, dst).hashCode()) % num_parts, IntegerType())

edge.withColumn("v_cut", random_vertex_cut(col("src"), col("dst")).repartition(256, "v_cut")

This approach can help some, but not as well as GraphX.

answered Sep 20 '22 06:09

John

Related questions
                            
                                Difference between df.SaveAsTable and spark.sql(Create table..)
                            
                                Cannot do simple task on ec2 spark cluster from local pyspark
                            
                                Apache Spark -- MlLib -- Collaborative filtering
                            
                                AWS EMR and Spark 1.0.0
                            
                                Apache spark in memory caching
                            
                                How to load directory of JSON files into Apache Spark in Python
                            
                                How to submit spark job from within java program to standalone spark cluster without using spark-submit?
                            
                                Apache Spark GraphX connected components
                            
                                What are Spark RDD graph, lineage graph, DAG of Spark tasks? what are their relations
                            
                                Cassandra timeout during read query at consistency ONE (1 responses were required but only 0 replica responded)
                            
                                What is the equivalent to scala.util.Try in pyspark?
                            
                                Google Cloud Dataproc configuration issues
                            
                                Feature normalization algorithm in Spark
                            
                                Joining a large and a ginormous spark dataframe
                            
                                How to properly wait for apache spark launcher job during launching it from another application?
                            
                                Using Futures within Spark
                            
                                How to execute a SQL query against ElasticSearch (using org.elasticsearch.spark.sql format)?
                            
                                Simple command for extracting column names in sparklyr (R+spark)
                            
                                Spark - Reading JSON from Partitioned Folders using Firehose
                            
                                spark dataframe trim column and convert

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Partitioning with Spark Graphframes

Tags:

apache-spark

graphframes

John

People also ask

2 Answers

user7347764

John

Recent Activity

Donate For Us