Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

While submit job with pyspark, how to access static files upload with --files argument?

for example, i have a folder:

/
  - test.py
  - test.yml

and the job is submited to spark cluster with:

gcloud beta dataproc jobs submit pyspark --files=test.yml "test.py"

in the test.py, I want to access the static file I uploaded.

with open('test.yml') as test_file:
    logging.info(test_file.read())

but got the following exception:

IOError: [Errno 2] No such file or directory: 'test.yml'

How to access the file I uploaded?

like image 461
lucemia Avatar asked Jan 22 '16 05:01

lucemia


2 Answers

Currently, as Dataproc is not in beta anymore, in order to direct access a file in the Cloud Storage from the PySpark code, submitting the job with --files parameter will do the work. SparkFiles is not required. For example:

gcloud dataproc jobs submit pyspark \
  --cluster *cluster name* --region *region name* \
  --files gs://<BUCKET NAME>/<FILE NAME> gs://<BUCKET NAME>/filename.py

While reading input from gcs via Spark API, it works with gcs connector.

like image 170
BabyPanda Avatar answered Sep 22 '22 07:09

BabyPanda


Files distributed using SparkContext.addFile (and --files) can be accessed via SparkFiles. It provides two methods:

  • getRootDirectory() - returns root directory for distributed files
  • get(filename) - returns absolute path to the file

I am not sure if there are any Dataproc specific limitations but something like this should work just fine:

from pyspark import SparkFiles

with open(SparkFiles.get('test.yml')) as test_file:
    logging.info(test_file.read())
like image 44
zero323 Avatar answered Sep 22 '22 07:09

zero323