I have my files placed on dataproc storage as:
And currently working on inotebook.ipynb file and need to use functions in test1.py and test2.py files. Locally, I can use !python ....py and use functions available (or create a package and install). Is any of these options available on google cloud dataproc notebook?
I tried suggestions from below links and none worked:
Is there anyway to install a custom package or somehow run .py files from same sub-directory as my notebook file on dataproc?
Unfortunately it is still a limitation of dataproc on using custom packages that are stored in GCS. I was able to make the mentioned workaround work with a few changes. I added defining of prefix to be able to properly point to the correct files under a directory and looping through the returned object to download the files to the local dataproc cluster and execute the succeeding lines of code. See code below:
GCS bucket structure:
my-bucket
└───notebooks
└───jupyter
| gcs_test.ipynb
└───dependencies
└─── hi_gcs.py
└─── hello_gcs.py
hi_gcs.py:
def say_hi(name):
return "Hi {}!".format(name)
hello_gcs.py:
def say_hello(name):
return "Hello {}!".format(name)
gcs_test.ipynb:
from google.cloud import storage
def get_module():
client = storage.Client()
bucket = client.get_bucket('my-bucket')
blobs = list(client.list_blobs(bucket,prefix='notebooks/jupyter/dependencies/'))
# define the path to your python files at prefix
for blob in blobs[1:]: # skip 1st element since this is the top directory
name = blob.name.split('/')[-1] # get the filename only
blob.download_to_filename(name) # download python files to the local dataproc cluster
def use_my_module(val):
get_module()
import hi_gcs
import hello_gcs
print(hello_gcs.say_hello(val))
print(hi_gcs.say_hi(val))
use_my_module('User 1')
Output:
Hello User 1!
Hi User 1!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With