Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add jar to pyspark when using notebook

I'm trying the mongodb hadoop integration with spark but can't figure out how to make the jars accessible to an IPython notebook.

Here what I'm trying to do:

# set up parameters for reading from MongoDB via Hadoop input format
config = {"mongo.input.uri": "mongodb://localhost:27017/db.collection"}
inputFormatClassName = "com.mongodb.hadoop.MongoInputFormat"

# these values worked but others might as well
keyClassName = "org.apache.hadoop.io.Text"
valueClassName = "org.apache.hadoop.io.MapWritable"

# Do some reading from mongo
items = sc.newAPIHadoopRDD(inputFormatClassName, keyClassName, valueClassName, None, None, config)

This code works fine when I launch it in pyspark using the following command:

spark-1.4.1/bin/pyspark --jars 'mongo-hadoop-core-1.4.0.jar,mongo-java-driver-3.0.2.jar'

where mongo-hadoop-core-1.4.0.jar and mongo-java-driver-2.10.1.jar allows using mongodb from java. However, when I do this:

IPYTHON_OPTS="notebook" spark-1.4.1/bin/pyspark --jars 'mongo-hadoop-core-1.4.0.jar,mongo-java-driver-3.0.2.jar'

The jars are not available anymore and I get the following error:

java.lang.ClassNotFoundException: com.mongodb.hadoop.MongoInputFormat

Does anyone know how to make jars available to the spark in the IPython notebook? I'm pretty sure this is not specific to mongo so maybe someone already has succeeded in adding jars to the classpath while using the notebook?

like image 355
zermelozf Avatar asked Sep 27 '22 05:09

zermelozf


1 Answers

Very similar, please let me know if this helps: https://issues.apache.org/jira/browse/SPARK-5185

like image 88
venuktan Avatar answered Oct 20 '22 07:10

venuktan