I'm trying to create a Graph using some Google Web Graph data which can be found here: https://snap.stanford.edu/data/web-Google.html <pre class="prettyprint"><code>import org.apache.spark._ import org.apache.spark.graphx._ import org.apache.spark.rdd.RDD val textFile = sc.textFile("hdfs://n018-data.hursley.ibm.com/user/romeo/web-Google.txt") val arrayForm = textFile.filter(_.charAt(0)!='#').map(_.split("\\s+")).cache() val nodes = arrayForm.flatMap(array => array).distinct().map(_.toLong) val edges = arrayForm.map(line => Edge(line(0).toLong,line(1).toLong)) val graph = Graph(nodes,edges) </code></pre> Unfortunately, I get this error: <pre class="prettyprint"><code><console>:27: error: type mismatch; found : org.apache.spark.rdd.RDD[Long] required: org.apache.spark.rdd.RDD[(org.apache.spark.graphx.VertexId, ?)] Error occurred in an application involving default arguments. val graph = Graph(nodes,edges) </code></pre> So how can I create a VertexId object? For my understanding it should be sufficient to pass a Long. Any ideas? Thanks a lot! romeo

Not exactly. If you take a look at the signature of the <code>apply</code> method of the <code>Graph</code> object you'll see something like this (for a full signature see API docs): <pre class="prettyprint"><code>apply[VD, ED]( vertices: RDD[(VertexId, VD)], edges: RDD[Edge[ED]], defaultVertexAttr: VD) </code></pre> As you can read in a description: <blockquote> Construct a graph from a collection of vertices and edges with attributes. </blockquote> Because of that you cannot simply pass <code>RDD[Long]</code> as a <code>vertices</code> argument ( <code>RDD[Edge[Nothing]]</code> as <code>edges</code> won't work either). <pre class="prettyprint"><code>import scala.{Option, None} val nodes: RDD[(VertexId, Option[String])] = arrayForm. flatMap(array => array). map((_.toLong, None)) val edges: RDD[Edge[String]] = arrayForm. map(line => Edge(line(0).toLong, line(1).toLong, "")) </code></pre> Note that: <blockquote> Duplicate vertices are picked arbitrarily </blockquote> so <code>.distinct()</code> on <code>nodes</code> is obsolete in this case. If you want to create a <code>Graph</code> without attributes you can use <code>Graph.fromEdgeTuples</code>.

How to create a VertexId in Apache Spark GraphX using a Long data type?

Tags:

scala

apache-spark

spark-graphx

I'm trying to create a Graph using some Google Web Graph data which can be found here:

https://snap.stanford.edu/data/web-Google.html

import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD



val textFile = sc.textFile("hdfs://n018-data.hursley.ibm.com/user/romeo/web-Google.txt")
val arrayForm = textFile.filter(_.charAt(0)!='#').map(_.split("\\s+")).cache()
val nodes = arrayForm.flatMap(array => array).distinct().map(_.toLong)
val edges = arrayForm.map(line => Edge(line(0).toLong,line(1).toLong))

val graph = Graph(nodes,edges)

Unfortunately, I get this error:

<console>:27: error: type mismatch;
 found   : org.apache.spark.rdd.RDD[Long]
 required: org.apache.spark.rdd.RDD[(org.apache.spark.graphx.VertexId, ?)]
Error occurred in an application involving default arguments.
       val graph = Graph(nodes,edges)

So how can I create a VertexId object? For my understanding it should be sufficient to pass a Long.

Any ideas?

Thanks a lot!

romeo

503

asked Jul 02 '15 15:07

Romeo Kienzler

1 Answers

Not exactly. If you take a look at the signature of the apply method of the Graph object you'll see something like this (for a full signature see API docs):

apply[VD, ED](
    vertices: RDD[(VertexId, VD)], edges: RDD[Edge[ED]], defaultVertexAttr: VD)

As you can read in a description:

Construct a graph from a collection of vertices and edges with attributes.

Because of that you cannot simply pass RDD[Long] as a vertices argument ( RDD[Edge[Nothing]] as edges won't work either).

import scala.{Option, None}

val nodes: RDD[(VertexId, Option[String])] = arrayForm.
    flatMap(array => array).
    map((_.toLong, None))

val edges: RDD[Edge[String]] = arrayForm.
    map(line => Edge(line(0).toLong, line(1).toLong, ""))

Note that:

Duplicate vertices are picked arbitrarily

so .distinct() on nodes is obsolete in this case.

If you want to create a Graph without attributes you can use Graph.fromEdgeTuples.

answered Sep 19 '22 02:09

zero323

Related questions
                            
                                Group List elements with a distance less than x
                            
                                Implement your own object binder for Route parameter of some object type in Play scala
                            
                                Is there a Scala equivalent of Haskell's Data.These (A, B, or (A and B))?
                            
                                What is the first semicolon in `addCommandAlias` method used for in SBT?
                            
                                Using Iteratees and Enumerators in Play Scala to Stream Data to S3
                            
                                scalaz-stream how to implement `ask-then-wait-reply` tcp client
                            
                                What is the difference between hot and cold observables in RXScala?
                            
                                Replace if-without-else in Scala
                            
                                SBT & Json4s serializing Joda Time: How can I access the .ext package?
                            
                                How to disambiguate case class creation with multiple parameter lists?
                            
                                Scala error Could not find implicit value for parameter
                            
                                Python style decorator in Scala
                            
                                Scala - How to "delay" expression's compilation
                            
                                Scala implicit ambiguity doesn't get resolved without annoying dummy argument to mark the type.
                            
                                How to restrict processing to specified number of cores in spark standalone
                            
                                How to install library with SBT libraryDependencies in an Intellij project
                            
                                split string by char
                            
                                How to calculate the mean of each pair in an RDD consisting of (Key, [Value]) pairs in Spark?
                            
                                Counting regex matches in Scala?
                            
                                Circular Dependency Error for Google Guice with Play2.4 and scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With