for example, i have a folder:
/
- test.py
- test.yml
and the job is submited to spark cluster with:
gcloud beta dataproc jobs submit pyspark --files=test.yml "test.py"
in the test.py
, I want to access the static file I uploaded.
with open('test.yml') as test_file:
logging.info(test_file.read())
but got the following exception:
IOError: [Errno 2] No such file or directory: 'test.yml'
How to access the file I uploaded?
Currently, as Dataproc is not in beta anymore, in order to direct access a file in the Cloud Storage from the PySpark code, submitting the job with --files
parameter will do the work. SparkFiles
is not required. For example:
gcloud dataproc jobs submit pyspark \
--cluster *cluster name* --region *region name* \
--files gs://<BUCKET NAME>/<FILE NAME> gs://<BUCKET NAME>/filename.py
While reading input from gcs via Spark API, it works with gcs connector.
Files distributed using SparkContext.addFile
(and --files
) can be accessed via SparkFiles
. It provides two methods:
getRootDirectory()
- returns root directory for distributed filesget(filename)
- returns absolute path to the fileI am not sure if there are any Dataproc specific limitations but something like this should work just fine:
from pyspark import SparkFiles
with open(SparkFiles.get('test.yml')) as test_file:
logging.info(test_file.read())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With