I'm launching a pyspark program:
$ export SPARK_HOME= $ export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.9-src.zip $ python
And the py code:
from pyspark import SparkContext, SparkConf SparkConf().setAppName("Example").setMaster("local[2]") sc = SparkContext(conf=conf)
How do I add jar dependencies such as the Databricks csv jar? Using the command line, I can add the package like this:
$ pyspark/spark-submit --packages com.databricks:spark-csv_2.10:1.3.0
But I'm not using any of these. The program is part of a larger workflow that is not using spark-submit I should be able to run my ./foo.py program and it should just work.
register 'script.py' using jython as script_udf; a = LOAD 'data. json' USING PigStorage('*') as (line:chararray); teams = FOREACH a GENERATE script_udf.
2021-01-19 Updated
There are many approaches here (setting ENV vars, adding to $SPARK_HOME/conf/spark-defaults.conf, etc...) other answers already cover these. I wanted to add an answer for those specifically wanting to do this from within a Python Script or Jupyter Notebook.
When you create the Spark session you can add a .config() that pulls in the specific Jar file (in my case I wanted the Kafka package loaded):
spark = SparkSession.builder.appName('my_awesome')\ .config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1')\ .getOrCreate()
Using this line of code I didn't need to do anything else (no ENVs or conf file changes).
:3.0.1
at the end.Any dependencies can be passed using spark.jars.packages
(setting spark.jars
should work as well) property in the $SPARK_HOME/conf/spark-defaults.conf
. It should be a comma separated list of coordinates.
And packages or classpath properties have to be set before JVM is started and this happens during SparkConf
initialization. It means that SparkConf.set
method cannot be used here.
Alternative approach is to set PYSPARK_SUBMIT_ARGS
environment variable before SparkConf
object is initialized:
import os from pyspark import SparkConf SUBMIT_ARGS = "--packages com.databricks:spark-csv_2.11:1.2.0 pyspark-shell" os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS conf = SparkConf() sc = SparkContext(conf=conf)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With