Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add Jar to standalone pyspark

Tags:

I'm launching a pyspark program:

$ export SPARK_HOME= $ export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.9-src.zip $ python 

And the py code:

from pyspark import SparkContext, SparkConf  SparkConf().setAppName("Example").setMaster("local[2]") sc = SparkContext(conf=conf) 

How do I add jar dependencies such as the Databricks csv jar? Using the command line, I can add the package like this:

$ pyspark/spark-submit --packages com.databricks:spark-csv_2.10:1.3.0  

But I'm not using any of these. The program is part of a larger workflow that is not using spark-submit I should be able to run my ./foo.py program and it should just work.

  • I know you can set the spark properties for extraClassPath but you have to copy JAR files to each node?
  • Tried conf.set("spark.jars", "jar1,jar2") that didn't work too with a py4j CNF exception
like image 298
Nora Olsen Avatar asked Mar 03 '16 03:03

Nora Olsen


People also ask

How do I add a jar file to Python?

register 'script.py' using jython as script_udf; a = LOAD 'data. json' USING PigStorage('*') as (line:chararray); teams = FOREACH a GENERATE script_udf.


2 Answers

2021-01-19 Updated

There are many approaches here (setting ENV vars, adding to $SPARK_HOME/conf/spark-defaults.conf, etc...) other answers already cover these. I wanted to add an answer for those specifically wanting to do this from within a Python Script or Jupyter Notebook.

When you create the Spark session you can add a .config() that pulls in the specific Jar file (in my case I wanted the Kafka package loaded):

spark = SparkSession.builder.appName('my_awesome')\     .config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1')\     .getOrCreate() 

Using this line of code I didn't need to do anything else (no ENVs or conf file changes).

  • Note 1: The JAR file will dynamically download, you don't need to manually download it.
  • Note 2: Make sure the versions match what you want, so in the example above my Spark version is 3.0.1 so I have :3.0.1 at the end.
like image 174
Brian Wylie Avatar answered Sep 27 '22 19:09

Brian Wylie


Any dependencies can be passed using spark.jars.packages (setting spark.jars should work as well) property in the $SPARK_HOME/conf/spark-defaults.conf. It should be a comma separated list of coordinates.

And packages or classpath properties have to be set before JVM is started and this happens during SparkConf initialization. It means that SparkConf.set method cannot be used here.

Alternative approach is to set PYSPARK_SUBMIT_ARGS environment variable before SparkConf object is initialized:

import os from pyspark import SparkConf  SUBMIT_ARGS = "--packages com.databricks:spark-csv_2.11:1.2.0 pyspark-shell" os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS  conf = SparkConf() sc = SparkContext(conf=conf) 
like image 41
zero323 Avatar answered Sep 27 '22 18:09

zero323