Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spark-submit - Cannot import packages from environment submitted as --archive

I have a conda virtual environment, and I tried to pack it and then ran the spark-submit code by passing it as an --archive argument.

But from the spark-submit code, I am unable to import the packages available in the conda pack (pyspark_venv.tar.gz), it gives me module not found error.

I am using an EMR cluster.

My spark-submit code looks like this:

spark-submit --archives pyspark_venv.tar.gz#environment app.py

It is from the app.py, I am unable to import the packages

like image 350
Tom J Muthirenthi Avatar asked Jan 27 '26 09:01

Tom J Muthirenthi


1 Answers

From the Spark documentation, you should set PYSPARK_DRIVER_PYTHON and PYSPARK_PYTHON environment variables before calling spark-submit.

conda pack -f -o pyspark_conda_env.tar.gz

export PYSPARK_DRIVER_PYTHON=python # Do not set in cluster modes.
export PYSPARK_PYTHON=./environment/bin/python
spark-submit --archives pyspark_conda_env.tar.gz#environment app.py
like image 126
Gabriel M. Silva Avatar answered Jan 29 '26 12:01

Gabriel M. Silva



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!