Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a VertexId in Apache Spark GraphX using a Long data type?

I'm trying to create a Graph using some Google Web Graph data which can be found here:

https://snap.stanford.edu/data/web-Google.html

import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD



val textFile = sc.textFile("hdfs://n018-data.hursley.ibm.com/user/romeo/web-Google.txt")
val arrayForm = textFile.filter(_.charAt(0)!='#').map(_.split("\\s+")).cache()
val nodes = arrayForm.flatMap(array => array).distinct().map(_.toLong)
val edges = arrayForm.map(line => Edge(line(0).toLong,line(1).toLong))

val graph = Graph(nodes,edges)

Unfortunately, I get this error:

<console>:27: error: type mismatch;
 found   : org.apache.spark.rdd.RDD[Long]
 required: org.apache.spark.rdd.RDD[(org.apache.spark.graphx.VertexId, ?)]
Error occurred in an application involving default arguments.
       val graph = Graph(nodes,edges)

So how can I create a VertexId object? For my understanding it should be sufficient to pass a Long.

Any ideas?

Thanks a lot!

romeo

like image 503
Romeo Kienzler Avatar asked Jul 02 '15 15:07

Romeo Kienzler


People also ask

What is GraphX in big data?

GraphX unifies ETL, exploratory analysis, and iterative graph computation within a single system. You can view the same data as both graphs and collections, transform and join graphs with RDDs efficiently, and write custom iterative graph algorithms using the Pregel API. graph = Graph(vertices, edges)

How the command Pregel works in GraphX?

The Pregel operator terminates iteration and returns the final graph when there are no messages remaining. Note, unlike more standard Pregel implementations, vertices in GraphX can only send messages to neighboring vertices and the message construction is done in parallel using a user defined messaging function.

Which programming languages can be used for using GraphX?

We'll write Spark GraphX code using Scala programming language.

What is a triplet in GraphX?

concept triplet in category graphx You can also use the triplets() method to join together the vertices and edges based on VertexId . Although Graph natively stores its data as separate edge and vertex RDDs, triplets() is a convenience function that joins them together for you, as shown in the following listing.


1 Answers

Not exactly. If you take a look at the signature of the apply method of the Graph object you'll see something like this (for a full signature see API docs):

apply[VD, ED](
    vertices: RDD[(VertexId, VD)], edges: RDD[Edge[ED]], defaultVertexAttr: VD)

As you can read in a description:

Construct a graph from a collection of vertices and edges with attributes.

Because of that you cannot simply pass RDD[Long] as a vertices argument ( RDD[Edge[Nothing]] as edges won't work either).

import scala.{Option, None}

val nodes: RDD[(VertexId, Option[String])] = arrayForm.
    flatMap(array => array).
    map((_.toLong, None))

val edges: RDD[Edge[String]] = arrayForm.
    map(line => Edge(line(0).toLong, line(1).toLong, ""))

Note that:

Duplicate vertices are picked arbitrarily

so .distinct() on nodes is obsolete in this case.

If you want to create a Graph without attributes you can use Graph.fromEdgeTuples.

like image 70
zero323 Avatar answered Sep 19 '22 02:09

zero323