Storing a Graph in Spark Graphx with HDFS

Question

I have constructed a graph in Spark's GraphX. This graph is going to have potentially 1 billion nodes and upwards of 10 billion edges, so I don't want to have to build this graph over and over again.

I want to have the ability to build it once, save it (I think the best is in HDFS), run some processes on it, and then access it in a couple of days or weeks, add some new nodes and edges, and run some more processes on it.

How can I do that in Apache Spark's GraphX?

EDIT: I think I have found a potential solution, but I would like someone to confirm if this is the best way.

If I have a graph, say graph, I must store the graph by its vertexRDD and its edgeRDDs separately in a text file. Then, later in time, I can access those text files, like so:

graph.vertices.saveAsTextFile(somePath)
graph.edges.saveAsTextFile(somePath)

One question I have now is: should I use saveAsTextFile() or saveAsObjectFile() ? And then how should I access those file at a later time?

Gaurav Kumar · Accepted Answer

GraphX does not yet have a graph saving mechanism. Consequently, the next best thing to do is to save both the edges and vertices and construct the graph from that. If your vertices are complex in nature, you should use sequence files to save them.

 vertices.saveAsObjectFile("location/of/vertices")
 edges.saveAsObjectFile("location/of/edges")

And later on, you can read from disk and construct the graph.

val vertices = sc.objectFile[T]("/location/of/vertices")
val edges = sc.objectFile[T]("/location/of/edges")
val graph = Graph(vertices, edges)

BradRees · Answer

As you mentioned, you will have to save the edge and potentially the vertices data. The question is whether or not you are using custom vertex or edge classes. If there are no attributes on the edges or vertices, then you can just save the edge file and recreate the graph from that. A simple example using the GraphLoader would be:

graph.edges.saveAsTextFile(path)
...
val myGraph = GraphLoader.edgeListFile(path)

The only problem is that GraphLoader.edgeListFile returns a Graph[Int, Int] which can be an issue for large graphs. Once you are into the billions you would do something like:

graph.edges.saveAsTextFile(path)
graph.vertices.saveAsTextFile(path)
....
val rawData = sc.textFile(path)
val edges = rawData.map(convertToEdges)
val vert = sc.textFile(path).map(f => f.toLong)
val myGraph = (verts, edges, 1L)

def convertToEdges(line : String) : Edge[Long] = {
val txt = line.split(",")
new Edge(txt(0), txt(1), 1L)
}

I typically use saveAsText simply because I tend to use multiple programs to processes the same data file, but it really depends on your file system.

Storing a Graph in Spark Graphx with HDFS

Tags:

apache-spark

spark-graphx

edenmark

2 Answers

Gaurav Kumar

BradRees

Recent Activity

Donate For Us

Storing a Graph in Spark Graphx with HDFS

Tags:

apache-spark

spark-graphx

edenmark

2 Answers

Gaurav Kumar

BradRees

Related questions

Recent Activity

Donate For Us