Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to import python file as module in Jupyter notebook?

I am developing AWS Glue scripts and I am trying to use the Dev Endpoint. I followed the wizard to create a Dev Endpoint and a SageMaker notebook attached to it. When I open the SageMaker notebook, it directs me to a web page called Jupyter.

In Jupyter, I created several notebooks with my python files. The problem is that some shared python files could Not be imported into the notebooks as modules. I got the following error : " No module named shared.helper Traceback (most recent call last):

Import Error: No module named shared.helper

Here is my project structure on the Jupyter notebook:

my_project/
│
├── scripts/
│   ├── a_notebook.ipynb
│   ├── b_notebook.ipynb
|   ├── c_notebook.ipynb
│   ├── __init__.py     
│   └── shared/
         └── helper.py
         └── config.py
         └── __init__.py

I tried many attempts which I searched out on the Internet, but it didn't work.

In a_notebook.ipynb, I just use import shared.helper as helper, and it shows me the above error.

I don't know if there is anything in relation with the AWS Glue? As I am opening the Jupyter from the Sagemaker notebook under AWS Glue console.

like image 278
Bill Li Avatar asked Apr 16 '19 08:04

Bill Li


People also ask

Can you import py file in Jupyter Notebook?

A module is simply a text file named with a . py suffix, whose contents consist of Python code. A module can be imported into an interactive console environment (e.g. a Jupyter notebook) or into another module.

How do I import a module into Jupyter?

No more typing “import pandas as pd” 10 times a day Create a folder called startup if it's not already there. Add a new Python file called start.py. Put your favorite imports in this file. Launch IPython or a Jupyter Notebook and your favorite libraries will be automatically loaded every time!

How do I import a .py file into a module?

If you have your own python files you want to import, you can use the import statement as follows: >>> import my_file # assuming you have the file, my_file.py in the current directory. # For files in other directories, provide path to that file, absolute or relative.


2 Answers

TLDR

According to the docs

You need to upload your python files to an S3 bucket. If you have more than one you need to zip them. When you start the dev endpoint, there is a setting Python library path under Security configuration, script libraries, and job parameters (optional) to set the path to the S3 bucket containing custom libraries (including scripts, modules, packages). You'll also need to make sure the IAM Policy attached to the IAM role used by the dev endpoint has access to list/head/getobject etc for that bucket.

Details

It's a bit of extra work but the main reason is that the libraries need to be loaded to each and every DPU (execution container) in the Spark cluster.

When you use the Sparkmagic (pyspark) kernel, it is using a Spark library called livy to connect to and run your code on a remote Spark cluster. The dev endpoint is effectively a Spark cluster, and your "Sagemaker notebook"^ is connecting to the livy host on the Spark cluster.

This is quite different from a normal Python environment, mainly because the present-working-directory and where the code gets executed are not the same place. Sagemaker allows use of a lot of the Jupyter magics, so you can test this out and see.

For example in a paragraph run this

%pwd

It will show you what you expected to see, something like

/home/ec2-user/SageMaker

And try this:

%ls

And you'll see something like this

Glue Examples/ lost+found/ shared/ a_notebook.ipynb

Those magics are using the Notebook's context and showing you directories relative to it. If you try this:

import os
print(os.getcwd())

You'll see something quite different:

/mnt/yarn/usercache/livy/appcache/application_1564744666624_0002/

That's a Spark (hadoop HDFS really) directory from the driver container on the cluster. Hadoop directories are distributed with redundancy so it's not necessarily correct to say that the directory is in that container, nor is that really important. The point is that the directory is on the remote cluster, not on the ec2 instance running your notebook.

Sometimes a nice trick to load modules is to modify your sys.path to include a directory you want to import modules from. Unfortunately that doesn't work here because if you appended /home/ec2-user/Sagemaker to the path, firstly that path won't exist on HDFS, and secondly the pyspark context can't search the path on your notebook's EC2 host.

Another thing you can do to prove this is all true is to change your kernel in the running notebook. There's a kernel menu option for that in Jupyter. I suggest conda_python3.

Of course, this kernel will not be connected to the Spark cluster so no Spark code will work, but you can again try the above tests for %pwd, and print(os.getcwd()) and see that they now show the same local directory. You should also be able to import your module, although you may need to modify the path, e.g.

import os
import sys
shared_path = '/home/ec2-user/SageMaker/shared'
if shared_path not in sys.path:
    sys.path.append(shared_path)

You should then be able to run this

import helper

But at this point, you're not in the Sparkmagic (pyspark) kernel, so that's no good to you.

It's a long explanation but it should help make clear why the annoying requirement to upload scripts to an S3 bucket. When your dev endpoint launches, it has a hook to load your custom libraries from that location so they are available to the Spark cluster containers.

^ Note that Sagemaker is the AWS re-branding of Jupyter notebooks, which is a little confusing. Sagemaker is also the name of a service in AWS for automated machine learning model training / testing / deployment lifecycle management. It's essentially Jupyter notebooks plus some scheduling plus some API endpoints on top. I'd be surprised if it weren't something like papermill running under the hood.

like image 132
Davos Avatar answered Oct 03 '22 17:10

Davos


You can import modules into spark using:

spark.sparkContext.addPyFile("<hdfs_path>/foo.py")

Then just import it like:

import foo
from foo import bar

HDFS path examples:

Azure: "abfs://<container>@<storage_account>.dfs.core.windows.net/foo/bar.py"
AWS: "s3://<bucket>/foo/bar.py"
like image 29
utkarshgupta137 Avatar answered Oct 03 '22 16:10

utkarshgupta137