I'm trying to automatically include jars to my PySpark classpath. Right now I can type the following command and it works:
$ pyspark --jars /path/to/my.jar
I'd like to have that jar included by default so that I can only type pyspark
and also use it in IPython Notebook.
I've read that I can include the argument by setting PYSPARK_SUBMIT_ARGS in env:
export PYSPARK_SUBMIT_ARGS="--jars /path/to/my.jar"
Unfortunately the above doesn't work. I get the runtime error Failed to load class for data source
.
Running Spark 1.3.1.
Edit
My workaround when using IPython Notebook is the following:
$ IPYTHON_OPTS="notebook" pyspark --jars /path/to/my.jar
It's better to pass the driver and executor class paths as --conf , which adds them to the Spark session object itself and those paths are reflected in the Spark configuration. But please make sure to put JAR files on the same path across the cluster.
Spark JAR files let you package a project into a single file so it can be run on a Spark cluster. A lot of developers develop Spark code in brower based notebooks because they're unfamiliar with JAR files.
spark-packages.org is an external, community-managed list of third-party libraries, add-ons, and applications that work with Apache Spark. You can add a package as long as you have a GitHub repository.
You can add the jar files in the spark-defaults.conf file (located in the conf folder of your spark installation). If there is more than one entry in the jars list, use : as separator.
spark.driver.extraClassPath /path/to/my.jar
This property is documented in https://spark.apache.org/docs/1.3.1/configuration.html#runtime-environment
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With