I'm trying to use the graphframes package in pyspark in Jupyter Notebook (using Sagemaker and sparkmagic) on AWS EMR. I've tried adding a configuration option when creating the EMR cluster in the AWS console:
[{"classification":"spark-defaults", "properties":{"spark.jars.packages":"graphframes:graphframes:0.7.0-spark2.4-s_2.11"}, "configurations":[]}]
But I still got an error when trying to use the graphframes package in my pyspark code in jupyter notebook.
Here's my code (it's from the graphframes example):
# Create a Vertex DataFrame with unique ID column "id"
v = spark.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
], ["id", "name", "age"])
# Create an Edge DataFrame with "src" and "dst" columns
e = spark.createDataFrame([
("a", "b", "friend"),
("b", "c", "follow"),
("c", "b", "follow"),
], ["src", "dst", "relationship"])
# Create a GraphFrame
from graphframes import *
g = GraphFrame(v, e)
# Query: Get in-degree of each vertex.
g.inDegrees.show()
# Query: Count the number of "follow" connections in the graph.
g.edges.filter("relationship = 'follow'").count()
# Run PageRank algorithm, and show results.
results = g.pageRank(resetProbability=0.01, maxIter=20)
results.vertices.select("id", "pagerank").show()
And here's the output/error:
ImportError: No module named graphframes
I read through this git thread but all the potential work-arounds seem very complicated and require ssh-ing into the master node of the EMR cluster.
You can use AWS Step Functions to run PySpark applications as EMR Steps on an existing EMR cluster. Using Step Functions, we can also create the cluster, run multiple EMR Steps sequentially or in parallel, and finally, auto-terminate the cluster.
You can install Spark on an Amazon EMR cluster along with other Hadoop applications, and it can also leverage the EMR file system (EMRFS) to directly access data in Amazon S3.
You can access the Spark shell by connecting to the master node with SSH and invoking spark-shell . For more information about connecting to the master node, see Connect to the master node using SSH in the Amazon EMR Management Guide. The following examples use Apache HTTP Server access logs stored in Amazon S3.
I finally figured out that there is a PyPi package for graphframes. I used this to create a bootstrapping action as detailed here, although I changed things a little bit.
Here's what I did to get graphframes working on EMR:
#!/bin/bash
sudo pip install graphframes
[{"classification":"spark-defaults","properties":{"spark.jars.packages":"graphframes:graphframes:0.7.0-spark2.4-s_2.11"}}]
# Create a Vertex DataFrame with unique ID column "id"
v = spark.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
], ["id", "name", "age"])
# Create an Edge DataFrame with "src" and "dst" columns
e = spark.createDataFrame([
("a", "b", "friend"),
("b", "c", "follow"),
("c", "b", "follow"),
], ["src", "dst", "relationship"])
# Create a GraphFrame
from graphframes import *
g = GraphFrame(v, e)
# Query: Get in-degree of each vertex.
g.inDegrees.show()
# Query: Count the number of "follow" connections in the graph.
g.edges.filter("relationship = 'follow'").count()
# Run PageRank algorithm, and show results.
results = g.pageRank(resetProbability=0.01, maxIter=20)
results.vertices.select("id", "pagerank").show()
And this time, finally, I got the correct output:
+---+--------+
| id|inDegree|
+---+--------+
| c| 1|
| b| 2|
+---+--------+
+---+------------------+
| id| pagerank|
+---+------------------+
| b|1.0905890109440908|
| a| 0.01|
| c|1.8994109890559092|
+---+------------------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With