I'm trying to create a Graph using some Google Web Graph data which can be found here:
https://snap.stanford.edu/data/web-Google.html
import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
val textFile = sc.textFile("hdfs://n018-data.hursley.ibm.com/user/romeo/web-Google.txt")
val arrayForm = textFile.filter(_.charAt(0)!='#').map(_.split("\\s+")).cache()
val nodes = arrayForm.flatMap(array => array).distinct().map(_.toLong)
val edges = arrayForm.map(line => Edge(line(0).toLong,line(1).toLong))
val graph = Graph(nodes,edges)
Unfortunately, I get this error:
<console>:27: error: type mismatch;
found : org.apache.spark.rdd.RDD[Long]
required: org.apache.spark.rdd.RDD[(org.apache.spark.graphx.VertexId, ?)]
Error occurred in an application involving default arguments.
val graph = Graph(nodes,edges)
So how can I create a VertexId object? For my understanding it should be sufficient to pass a Long.
Any ideas?
Thanks a lot!
romeo
GraphX unifies ETL, exploratory analysis, and iterative graph computation within a single system. You can view the same data as both graphs and collections, transform and join graphs with RDDs efficiently, and write custom iterative graph algorithms using the Pregel API. graph = Graph(vertices, edges)
The Pregel operator terminates iteration and returns the final graph when there are no messages remaining. Note, unlike more standard Pregel implementations, vertices in GraphX can only send messages to neighboring vertices and the message construction is done in parallel using a user defined messaging function.
We'll write Spark GraphX code using Scala programming language.
concept triplet in category graphx You can also use the triplets() method to join together the vertices and edges based on VertexId . Although Graph natively stores its data as separate edge and vertex RDDs, triplets() is a convenience function that joins them together for you, as shown in the following listing.
Not exactly. If you take a look at the signature of the apply
method of the Graph
object you'll see something like this (for a full signature see API docs):
apply[VD, ED](
vertices: RDD[(VertexId, VD)], edges: RDD[Edge[ED]], defaultVertexAttr: VD)
As you can read in a description:
Construct a graph from a collection of vertices and edges with attributes.
Because of that you cannot simply pass RDD[Long]
as a vertices
argument ( RDD[Edge[Nothing]]
as edges
won't work either).
import scala.{Option, None}
val nodes: RDD[(VertexId, Option[String])] = arrayForm.
flatMap(array => array).
map((_.toLong, None))
val edges: RDD[Edge[String]] = arrayForm.
map(line => Edge(line(0).toLong, line(1).toLong, ""))
Note that:
Duplicate vertices are picked arbitrarily
so .distinct()
on nodes
is obsolete in this case.
If you want to create a Graph
without attributes you can use Graph.fromEdgeTuples
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With