While starting spark-submit / pyspark, we do have an option of specifying the jar files using the --jars
option. How can we specify maven dependencies in pyspark. Do we have to pass all the jars all the time when running a pyspark application or there is a cleaner way ?
Another way I find very practical for testing/developing is when creating the SparkSession within the script, in particular by adding the config
option and passing the Maven packages dependencies through spark.jars.packages
in this way:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]")\
.config('spark.jars.packages', 'groupId:artifactId:version')\
.getOrCreate()
This will automatically download the specified dependencies (for more than one package dependency specify in a comma-separated fashion) from the Maven repository (so double check your internet connection).
At the same way any other Spark parameter listed here can be passed to the Spark Session.
For the full list of Maven packages please refer to https://mvnrepository.com/
According to https://spark.apache.org/docs/latest/submitting-applications.html there is an option to specify --packages
in the form of a comma-delimited list of Maven coordinates.
./bin/spark-submit --packages my:awesome:package
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With