Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I run graphx with Python / pyspark?

I am attempting to run Spark graphx with Python using pyspark. My installation appears correct, as I am able to run the pyspark tutorials and the (Java) GraphX tutorials just fine. Presumably since GraphX is part of Spark, pyspark should be able to interface it, correct?

Here are the tutorials for pyspark: http://spark.apache.org/docs/0.9.0/quick-start.html http://spark.apache.org/docs/0.9.0/python-programming-guide.html

Here are the ones for GraphX: http://spark.apache.org/docs/0.9.0/graphx-programming-guide.html http://ampcamp.berkeley.edu/big-data-mini-course/graph-analytics-with-graphx.html

Can anyone convert the GraphX tutorial to be in Python?

like image 779
Glenn Strycker Avatar asked Apr 25 '14 20:04

Glenn Strycker


People also ask

Does PySpark support GraphX?

No. GraphX computation is only supported using the Scala and RDD APIs.

How do I run PySpark in Python?

Go to the Spark Installation directory from the command line and type bin/pyspark and press enter, this launches pyspark shell and gives you a prompt to interact with Spark in Python language. If you have set the Spark in a PATH then just enter pyspark in command line or terminal (mac users).

Is GraphX part of Spark?

GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge.


3 Answers

It looks like the python bindings to GraphX are delayed at least to Spark 1.4 1.5 ∞. It is waiting behind the Java API.

You can track the status at SPARK-3789 GRAPHX Python bindings for GraphX - ASF JIRA

like image 104
Misty Nodine Avatar answered Oct 19 '22 00:10

Misty Nodine


You should look at GraphFrames (https://github.com/graphframes/graphframes), which wraps GraphX algorithms under the DataFrames API and it provides Python interface.

Here is a quick example from https://graphframes.github.io/graphframes/docs/_site/quick-start.html, with slight modification so that it works

first start pyspark with the graphframes pkg loaded

pyspark --packages graphframes:graphframes:0.1.0-spark1.6

python code:

from graphframes import *

# Create a Vertex DataFrame with unique ID column "id"
v = sqlContext.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
], ["id", "name", "age"])

# Create an Edge DataFrame with "src" and "dst" columns
e = sqlContext.createDataFrame([
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
], ["src", "dst", "relationship"])
# Create a GraphFrame
g = GraphFrame(v, e)

# Query: Get in-degree of each vertex.
g.inDegrees.show()

# Query: Count the number of "follow" connections in the graph.
g.edges.filter("relationship = 'follow'").count()

# Run PageRank algorithm, and show results.
results = g.pageRank(resetProbability=0.01, maxIter=20)
results.vertices.select("id", "pagerank").show()
like image 22
zhibo Avatar answered Oct 19 '22 00:10

zhibo


GraphX 0.9.0 doesn't have python API yet. It's expected in upcoming releases.

like image 3
Wildfire Avatar answered Oct 19 '22 00:10

Wildfire