I'm launching a pyspark program: <pre class="prettyprint"><code>$ export SPARK_HOME= $ export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.9-src.zip $ python </code></pre> And the py code: <pre class="prettyprint"><code>from pyspark import SparkContext, SparkConf SparkConf().setAppName("Example").setMaster("local[2]") sc = SparkContext(conf=conf) </code></pre> How do I add jar dependencies such as the Databricks csv jar? Using the command line, I can add the package like this: <pre class="prettyprint"><code>$ pyspark/spark-submit --packages com.databricks:spark-csv_2.10:1.3.0 </code></pre> But I'm not using any of these. The program is part of a larger workflow that is not using spark-submit I should be able to run my ./foo.py program and it should just work. <ul> <li>I know you can set the spark properties for extraClassPath but you have to copy JAR files to each node? </li> <li>Tried conf.set("spark.jars", "jar1,jar2") that didn't work too with a py4j CNF exception</li> </ul>

2021-01-19 Updated There are many approaches here (setting ENV vars, adding to $SPARK_HOME/conf/spark-defaults.conf, etc...) other answers already cover these. I wanted to add an answer for those specifically wanting to do this from within a Python Script or Jupyter Notebook. When you create the Spark session you can add a .config() that pulls in the specific Jar file (in my case I wanted the Kafka package loaded): <pre class="prettyprint"><code>spark = SparkSession.builder.appName('my_awesome')\ .config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1')\ .getOrCreate() </code></pre> Using this line of code I didn't need to do anything else (no ENVs or conf file changes). <ul> <li>Note 1: The JAR file will dynamically download, you don't need to manually download it.</li> <li>Note 2: Make sure the versions match what you want, so in the example above my Spark version is 3.0.1 so I have <code>:3.0.1</code> at the end.</li> </ul>

Any dependencies can be passed using <code>spark.jars.packages</code> (setting <code>spark.jars</code> should work as well) property in the <code>$SPARK_HOME/conf/spark-defaults.conf</code>. It should be a comma separated list of coordinates. And packages or classpath properties have to be set before JVM is started and this happens during <code>SparkConf</code> initialization. It means that <code>SparkConf.set</code> method cannot be used here. Alternative approach is to set <code>PYSPARK_SUBMIT_ARGS</code> environment variable before <code>SparkConf</code> object is initialized: <pre class="prettyprint"><code>import os from pyspark import SparkConf SUBMIT_ARGS = "--packages com.databricks:spark-csv_2.11:1.2.0 pyspark-shell" os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS conf = SparkConf() sc = SparkContext(conf=conf) </code></pre>

Add Jar to standalone pyspark

Tags:

I'm launching a pyspark program:

$ export SPARK_HOME= $ export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.9-src.zip $ python

And the py code:

from pyspark import SparkContext, SparkConf  SparkConf().setAppName("Example").setMaster("local[2]") sc = SparkContext(conf=conf)

How do I add jar dependencies such as the Databricks csv jar? Using the command line, I can add the package like this:

$ pyspark/spark-submit --packages com.databricks:spark-csv_2.10:1.3.0

But I'm not using any of these. The program is part of a larger workflow that is not using spark-submit I should be able to run my ./foo.py program and it should just work.

I know you can set the spark properties for extraClassPath but you have to copy JAR files to each node?
Tried conf.set("spark.jars", "jar1,jar2") that didn't work too with a py4j CNF exception

298

asked Mar 03 '16 03:03

Nora Olsen

2 Answers

2021-01-19 Updated

There are many approaches here (setting ENV vars, adding to $SPARK_HOME/conf/spark-defaults.conf, etc...) other answers already cover these. I wanted to add an answer for those specifically wanting to do this from within a Python Script or Jupyter Notebook.

When you create the Spark session you can add a .config() that pulls in the specific Jar file (in my case I wanted the Kafka package loaded):

spark = SparkSession.builder.appName('my_awesome')\     .config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1')\     .getOrCreate()

Using this line of code I didn't need to do anything else (no ENVs or conf file changes).

Note 1: The JAR file will dynamically download, you don't need to manually download it.
Note 2: Make sure the versions match what you want, so in the example above my Spark version is 3.0.1 so I have :3.0.1 at the end.

174

answered Sep 27 '22 19:09

Brian Wylie

Any dependencies can be passed using spark.jars.packages (setting spark.jars should work as well) property in the $SPARK_HOME/conf/spark-defaults.conf. It should be a comma separated list of coordinates.

And packages or classpath properties have to be set before JVM is started and this happens during SparkConf initialization. It means that SparkConf.set method cannot be used here.

Alternative approach is to set PYSPARK_SUBMIT_ARGS environment variable before SparkConf object is initialized:

import os from pyspark import SparkConf  SUBMIT_ARGS = "--packages com.databricks:spark-csv_2.11:1.2.0 pyspark-shell" os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS  conf = SparkConf() sc = SparkContext(conf=conf)

answered Sep 27 '22 18:09

zero323

Related questions
                            
                                CORS enabled but response for preflight has invalid HTTP status code 404 when POSTing JSON
                            
                                Convert ICollection<T> to List<T>
                            
                                How to create anchor tags with Vue Router
                            
                                Align material icon vertically
                            
                                Error: Could not find or load main class org.gradle.launcher.GradleMain
                            
                                How to kill a running process using ansible?
                            
                                Polylang: How to translate custom strings?
                            
                                How to load a pickle file from S3 to use in AWS Lambda?
                            
                                zIndex isn't working for a react native project
                            
                                Flutter Sortable Drag And Drop ListView
                            
                                React-native google signin gives Developer Error
                            
                                Named route navigation error: type 'MaterialPageRoute<dynamic>' is not a subtype of type 'Route<String>'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With