How do we specify maven dependencies in pyspark

Question

While starting spark-submit / pyspark, we do have an option of specifying the jar files using the --jars option. How can we specify maven dependencies in pyspark. Do we have to pass all the jars all the time when running a pyspark application or there is a cleaner way ?

Vzzarr · Accepted Answer

Another way I find very practical for testing/developing is when creating the SparkSession within the script, in particular by adding the config option and passing the Maven packages dependencies through spark.jars.packages in this way:

from pyspark.sql import SparkSession


spark = SparkSession.builder.master("local[*]")\
        .config('spark.jars.packages', 'groupId:artifactId:version')\
        .getOrCreate()

This will automatically download the specified dependencies (for more than one package dependency specify in a comma-separated fashion) from the Maven repository (so double check your internet connection).

At the same way any other Spark parameter listed here can be passed to the Spark Session.

For the full list of Maven packages please refer to https://mvnrepository.com/

Martin Kretz · Answer

According to https://spark.apache.org/docs/latest/submitting-applications.html there is an option to specify --packages in the form of a comma-delimited list of Maven coordinates.

./bin/spark-submit --packages my:awesome:package

How do we specify maven dependencies in pyspark

Tags:

maven

apache-spark

pyspark

Neeleshkumar S

2 Answers

Vzzarr

Martin Kretz

Recent Activity

Donate For Us

How do we specify maven dependencies in pyspark

Tags:

maven

apache-spark

pyspark

Neeleshkumar S

2 Answers

Vzzarr

Martin Kretz

Related questions

Recent Activity

Donate For Us