Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do we specify maven dependencies in pyspark

While starting spark-submit / pyspark, we do have an option of specifying the jar files using the --jars option. How can we specify maven dependencies in pyspark. Do we have to pass all the jars all the time when running a pyspark application or there is a cleaner way ?

like image 261
Neeleshkumar S Avatar asked Mar 23 '17 14:03

Neeleshkumar S


2 Answers

Another way I find very practical for testing/developing is when creating the SparkSession within the script, in particular by adding the config option and passing the Maven packages dependencies through spark.jars.packages in this way:

from pyspark.sql import SparkSession


spark = SparkSession.builder.master("local[*]")\
        .config('spark.jars.packages', 'groupId:artifactId:version')\
        .getOrCreate()

This will automatically download the specified dependencies (for more than one package dependency specify in a comma-separated fashion) from the Maven repository (so double check your internet connection).

At the same way any other Spark parameter listed here can be passed to the Spark Session.

For the full list of Maven packages please refer to https://mvnrepository.com/

like image 52
Vzzarr Avatar answered Sep 17 '22 18:09

Vzzarr


According to https://spark.apache.org/docs/latest/submitting-applications.html there is an option to specify --packages in the form of a comma-delimited list of Maven coordinates.

./bin/spark-submit --packages my:awesome:package
like image 24
Martin Kretz Avatar answered Sep 17 '22 18:09

Martin Kretz