I have following structure on Google Cloud Storage (GCS) bucket :
gs://my_bucket/py_scripts/
wrapper.py
mymodule.py
_init__.py
I am running wrapper.py through Dataproc as a pyspark job and it imports mymodule using import mymodule at the start but the job is returning error saying no module named mymodule even though they are at the same path. This however works fine in the Unix environment.
Note that _init__.py is empty. Also tested from mymodule import myfunc but returns same error.
Can you provide your pyspark job submit command ? I suspect you are not passing "--py-files" params to provide other python files to job. Check for reference https://cloud.google.com/sdk/gcloud/reference/dataproc/jobs/submit/pyspark . Dataproc will not assume files in same GS bucket as input to job.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With