I am using the Jupyter notebook with Pyspark with the following docker image: Jupyter all-spark-notebook Now I would like to write a pyspark streaming application which consumes messages from Kafka. In the Spark-Kafka Integration guide they describe how to deploy such an application using spark-submit (it requires linking an external jar - explanation is in 3. Deploying). But since I am using Jupyter notebook I never actually run the <code>spark-submit</code> command, I assume it gets run in the back if I press execute. In the <code>spark-submit</code> command you can specify some parameters, one of them is <code>-jars</code>, but it is not clear to me how I can set this parameter from the notebook (or externally via environment variables?). I am assuming I can link this external jar dynamically via the <code>SparkConf</code> or the <code>SparkContext</code> object. Has anyone experience on how to perform the linking properly from the notebook?

Indeed, there is a way to link it dynamically via the SparkConf object when you create the SparkSession, as explained in this answer: <pre class="prettyprint lang-py prettyprint-override"><code>spark = SparkSession \ .builder \ .appName("My App") \ .config("spark.jars", "/path/to/jar.jar,/path/to/another/jar.jar") \ .getOrCreate() </code></pre>

In case someone is the same as me: I tried all above solutions and none of them works for me. What I'm trying to do is to use Delta Lake in the Jupyter notebook. Finally I can use <code>from delta.tables import *</code> by calling <code>SparkContext.addPyFile("/path/to/your/jar.jar")</code> first. Though in the spark official docs, it only mentions adding <code>.zip</code> or <code>.py</code> file, but I tried <code>.jar</code> and it worked perfectly.

Adding custom jars to pyspark in jupyter notebook

Tags:

python-3.x

jupyter-notebook

apache-kafka

pyspark

spark-streaming

I am using the Jupyter notebook with Pyspark with the following docker image: Jupyter all-spark-notebook

Now I would like to write a pyspark streaming application which consumes messages from Kafka. In the Spark-Kafka Integration guide they describe how to deploy such an application using spark-submit (it requires linking an external jar - explanation is in 3. Deploying). But since I am using Jupyter notebook I never actually run the spark-submit command, I assume it gets run in the back if I press execute.

In the spark-submit command you can specify some parameters, one of them is -jars, but it is not clear to me how I can set this parameter from the notebook (or externally via environment variables?). I am assuming I can link this external jar dynamically via the SparkConf or the SparkContext object. Has anyone experience on how to perform the linking properly from the notebook?

530

asked Mar 11 '16 17:03

DDW

4 Answers

I've managed to get it working from within the jupyter notebook which is running form the all-spark container.

I start a python3 notebook in jupyterhub and overwrite the PYSPARK_SUBMIT_ARGS flag as shown below. The Kafka consumer library was downloaded from the maven repository and put in my home directory /home/jovyan:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = 
  '--jars /home/jovyan/spark-streaming-kafka-assembly_2.10-1.6.1.jar pyspark-shell'

import pyspark
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext

sc = pyspark.SparkContext()
ssc = StreamingContext(sc,1)

broker = "<my_broker_ip>"
directKafkaStream = KafkaUtils.createDirectStream(ssc, ["test1"],
                        {"metadata.broker.list": broker})
directKafkaStream.pprint()
ssc.start()

Note: Don't forget the pyspark-shell in the environment variables!

Extension: If you want to include code from spark-packages you can use the --packages flag instead. An example on how to do this in the all-spark-notebook can be found here

123

answered Sep 21 '22 05:09

DDW

Indeed, there is a way to link it dynamically via the SparkConf object when you create the SparkSession, as explained in this answer:

spark = SparkSession \
    .builder \
    .appName("My App") \
    .config("spark.jars", "/path/to/jar.jar,/path/to/another/jar.jar") \
    .getOrCreate()

answered Sep 21 '22 05:09

Nandan Rao

You can run your jupyter notebook with the pyspark command by setting the relevant environment variables:

export PYSPARK_DRIVER_PYTHON=jupyter
export IPYTHON=1
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --port=XXX --ip=YYY"

with XXX being the port you want to use to access the notebook and YYY being the ip address.

Now simply run pyspark and add --jars as a switch the same as you would spark submit

answered Sep 21 '22 05:09

Assaf Mendelson

In case someone is the same as me: I tried all above solutions and none of them works for me. What I'm trying to do is to use Delta Lake in the Jupyter notebook.

Finally I can use from delta.tables import * by calling SparkContext.addPyFile("/path/to/your/jar.jar") first. Though in the spark official docs, it only mentions adding .zip or .py file, but I tried .jar and it worked perfectly.

answered Sep 22 '22 05:09

Dd__Mad

Related questions
                            
                                Create hierarchy column in pandas
                            
                                What is the role of [:] in overwriting a list in a for loop?
                            
                                Why doesn't pyGame or pyglet support python 3?
                            
                                Basic Python imports question
                            
                                Python 3 replacement for ftputil?
                            
                                install python3 + lxml on windows
                            
                                UnicodeEncodeError when using the compile function
                            
                                Adding type-hinting to functions that return boto3 objects?
                            
                                Import parent directory for brief tests
                            
                                How to silence statsmodels.fit() in python
                            
                                Does TensorFlow 1.9 support Python 3.7
                            
                                accessing a python int literals methods [duplicate]
                            
                                SQLALCHEMY_DATABASE_URI not set
                            
                                How do I type hint a filename in a function?
                            
                                Virtualenv - workon command not found
                            
                                Python 3 integer division. How to make math operators consistant with C
                            
                                import-im6.q16: not authorized error 'os' @ error/constitue.c/WriteImage/1037 for a Python web scraper
                            
                                Imports in __init__.py and 'import as' statement
                            
                                Django Admin nested inline
                            
                                Difference between static STATIC_URL and STATIC_ROOT on Django

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Adding custom jars to pyspark in jupyter notebook

Tags:

python-3.x

jupyter-notebook

apache-kafka

pyspark

spark-streaming

DDW

People also ask

4 Answers

DDW

Nandan Rao

Assaf Mendelson

Dd__Mad

Recent Activity

Donate For Us