I have a conda virtual environment, and I tried to pack it and then ran the spark-submit code by passing it as an --archive argument.
But from the spark-submit code, I am unable to import the packages available in the conda pack (pyspark_venv.tar.gz), it gives me module not found error.
I am using an EMR cluster.
My spark-submit code looks like this:
spark-submit --archives pyspark_venv.tar.gz#environment app.py
It is from the app.py, I am unable to import the packages
From the Spark documentation, you should set PYSPARK_DRIVER_PYTHON and PYSPARK_PYTHON environment variables before calling spark-submit.
conda pack -f -o pyspark_conda_env.tar.gz
export PYSPARK_DRIVER_PYTHON=python # Do not set in cluster modes.
export PYSPARK_PYTHON=./environment/bin/python
spark-submit --archives pyspark_conda_env.tar.gz#environment app.py
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With