Jupyter (iPython) notebook is deservedly known as a good tool for prototyping the code and doing all kinds of machine learning stuff interactively. But when I use it, I inevitably run into the following: <ul> <li>the notebook quickly becomes too complex and messy to be maintained and improved further as notebook, and I have to make python scripts out of it;</li> <li>when it comes to production code (e.g. one that needs to be re-run every day), the notebook again is not the best format.</li> </ul> Suppose I've developed a whole machine learning pipeline in jupyter that includes fetching raw data from various sources, cleaning the data, feature engineering, and training models after all. Now what's the best logic to make scripts from it with efficient and readable code? I used to tackle it several ways so far: <ol> <li> Simply convert .ipynb to .py and, with only slight changes, hard-code all the pipeline from the notebook into one python script. <ul> <li>'+': quick</li> <li>'-': dirty, non-flexible, not convenient to maintain</li> </ul> </li> <li> Make a single script with many functions (approximately, 1 function for each one or two cell), trying to comprise the stages of the pipeline with separate functions, and name them accordingly. Then specify all parameters and global constants via <code>argparse</code>. <ul> <li>'+': more flexible usage; more readable code (if you properly transformed the pipeline logic to functions)</li> <li>'-': oftentimes, the pipeline is NOT splittable into logically completed pieces that could become functions without any quirks in the code. All these functions are typically needed to be only called once in the script rather than to be called many times inside loops, maps etc. Furthermore, each function typically takes the output of all functions called before, so one has to pass many arguments to each function. </li> </ul> </li> <li> The same thing as point (2), but now wrap all the functions inside the class. Now all the global constants, as well as outputs of each method can be stored as class attributes. <ul> <li>'+': you needn't to pass many arguments to each method -- all the previous outputs already stored as attributes</li> <li>'-': the overall logic of a task is still not captured -- it is data and machine learning pipeline, not just class. The only goal for the class is to be created, call all the methods sequentially one-by-one and then be removed. On top of this, classes are quite long to implement.</li> </ul> </li> <li>Convert a notebook into python module with several scripts. I didn't try this out, but I suspect this is the longest way to deal with the problem.</li> </ol> I suppose, this overall setting is very common among data scientists, but surprisingly I cannot find any useful advice around. Folks, please, share your ideas and experience. Have you ever encountered this issue? How have you tackled it?

<blockquote> Life saver: as you're writing your notebooks, incrementally refactor your code into functions, writing some minimal <code>assert</code> tests and docstrings. </blockquote> After that, refactoring from notebook to script is natural. Not only that, but it makes your life easier when writing long notebooks, even if you have no plans to turn them into anything else. Basic example of a cell's content with "minimal" tests and docstrings: <pre class="prettyprint"><code>def zip_count(f): """Given zip filename, returns number of files inside. str -> int""" from contextlib import closing with closing(zipfile.ZipFile(f)) as archive: num_files = len(archive.infolist()) return num_files zip_filename = 'data/myfile.zip' # Make sure `myfile` always has three files assert zip_count(zip_filename) == 3 # And total zip size is under 2 MB assert os.path.getsize(zip_filename) / 1024**2 < 2 print(zip_count(zip_filename)) </code></pre> Once you've exported it to bare <code>.py</code> files, your code will probably not be structured into classes yet. But it is worth the effort to have refactored your notebook to the point where it has a set of documented functions, each with a set of simple <code>assert</code> statements that can easily be moved into <code>tests.py</code> for testing with <code>pytest</code>, <code>unittest</code>, or what have you. If it makes sense, bundling these functions into methods for your classes is dead-easy after that. If all goes well, all you need to do after that is to write your <code>if __name__ == '__main__':</code> and its "hooks": if you're writing script to be called by the terminal you'll want to handle command-line arguments, if you're writing a module you'll want to think about its API with the <code>__init__.py</code> file, etc. It all depends on what the intended use case is, of course: there's quite a difference between converting a notebook to a small script vs. turning it into a full-fledged module or package. Here's a few ideas for a notebook-to-script workflow: <ol> <li>Export the Jupyter Notebook to Python file (.py) through the GUI.</li> <li>Remove the "helper" lines that don't do the actual work: <code>print</code> statements, plots, etc.</li> <li>If need be, bundle your logic into classes. The only extra refactoring work required should be to write your class docstrings and attributes.</li> <li>Write your script's entryways with <code>if __name__ == '__main__'</code>.</li> <li>Separate your <code>assert</code> statements for each of your functions/methods, and flesh out a minimal test suite in <code>tests.py</code>.</li> </ol>

Best practices for turning jupyter notebooks into python scripts

Tags:

python

jupyter

refactoring

ipython-notebook

readability

Jupyter (iPython) notebook is deservedly known as a good tool for prototyping the code and doing all kinds of machine learning stuff interactively. But when I use it, I inevitably run into the following:

the notebook quickly becomes too complex and messy to be maintained and improved further as notebook, and I have to make python scripts out of it;
when it comes to production code (e.g. one that needs to be re-run every day), the notebook again is not the best format.

Suppose I've developed a whole machine learning pipeline in jupyter that includes fetching raw data from various sources, cleaning the data, feature engineering, and training models after all. Now what's the best logic to make scripts from it with efficient and readable code? I used to tackle it several ways so far:

Simply convert .ipynb to .py and, with only slight changes, hard-code all the pipeline from the notebook into one python script.
- '+': quick
- '-': dirty, non-flexible, not convenient to maintain
Make a single script with many functions (approximately, 1 function for each one or two cell), trying to comprise the stages of the pipeline with separate functions, and name them accordingly. Then specify all parameters and global constants via argparse.
- '+': more flexible usage; more readable code (if you properly transformed the pipeline logic to functions)
- '-': oftentimes, the pipeline is NOT splittable into logically completed pieces that could become functions without any quirks in the code. All these functions are typically needed to be only called once in the script rather than to be called many times inside loops, maps etc. Furthermore, each function typically takes the output of all functions called before, so one has to pass many arguments to each function.
The same thing as point (2), but now wrap all the functions inside the class. Now all the global constants, as well as outputs of each method can be stored as class attributes.
- '+': you needn't to pass many arguments to each method -- all the previous outputs already stored as attributes
- '-': the overall logic of a task is still not captured -- it is data and machine learning pipeline, not just class. The only goal for the class is to be created, call all the methods sequentially one-by-one and then be removed. On top of this, classes are quite long to implement.
Convert a notebook into python module with several scripts. I didn't try this out, but I suspect this is the longest way to deal with the problem.

I suppose, this overall setting is very common among data scientists, but surprisingly I cannot find any useful advice around.

Folks, please, share your ideas and experience. Have you ever encountered this issue? How have you tackled it?

307

asked Aug 24 '15 13:08

kurtosis

2 Answers

Life saver: as you're writing your notebooks, incrementally refactor your code into functions, writing some minimal assert tests and docstrings.

After that, refactoring from notebook to script is natural. Not only that, but it makes your life easier when writing long notebooks, even if you have no plans to turn them into anything else.

Basic example of a cell's content with "minimal" tests and docstrings:

def zip_count(f):     """Given zip filename, returns number of files inside.      str -> int"""     from contextlib import closing     with closing(zipfile.ZipFile(f)) as archive:         num_files = len(archive.infolist())     return num_files  zip_filename = 'data/myfile.zip'  # Make sure `myfile` always has three files assert zip_count(zip_filename) == 3 # And total zip size is under 2 MB assert os.path.getsize(zip_filename) / 1024**2 < 2  print(zip_count(zip_filename))

Once you've exported it to bare .py files, your code will probably not be structured into classes yet. But it is worth the effort to have refactored your notebook to the point where it has a set of documented functions, each with a set of simple assert statements that can easily be moved into tests.py for testing with pytest, unittest, or what have you. If it makes sense, bundling these functions into methods for your classes is dead-easy after that.

If all goes well, all you need to do after that is to write your if __name__ == '__main__': and its "hooks": if you're writing script to be called by the terminal you'll want to handle command-line arguments, if you're writing a module you'll want to think about its API with the __init__.py file, etc.

It all depends on what the intended use case is, of course: there's quite a difference between converting a notebook to a small script vs. turning it into a full-fledged module or package.

Here's a few ideas for a notebook-to-script workflow:

Export the Jupyter Notebook to Python file (.py) through the GUI.
Remove the "helper" lines that don't do the actual work: print statements, plots, etc.
If need be, bundle your logic into classes. The only extra refactoring work required should be to write your class docstrings and attributes.
Write your script's entryways with if __name__ == '__main__'.
Separate your assert statements for each of your functions/methods, and flesh out a minimal test suite in tests.py.

answered Oct 13 '22 01:10

François Leblanc

We are having the similar issue. However we are using several notebooks for prototyping the outcomes which should become also several python scripts after all.

Our approach is that we put aside the code, which seams to repeat across those notebooks. We put it into the python module, which is imported by each notebook and also used in the production. We iteratively improve this module continuously and add tests of what we find during prototyping.

Notebooks then become rather like the configuration scripts (which we just plainly copy into the end resulting python files) and several prototyping checks and validations, which we do not need in the production.

Most of all we are not afraid of the refactoring :)

answered Oct 13 '22 00:10

Radek

Related questions
                            
                                How do I install python on alpine linux?
                            
                                PIL /JPEG Library: "decoder jpeg not available"
                            
                                Sliding window of M-by-N shape numpy.ndarray
                            
                                python list comprehension with multiple 'if's
                            
                                Get the second largest number in a list in linear time
                            
                                How to write unicode strings into a file? [duplicate]
                            
                                Display fullscreen mode on Tkinter
                            
                                Inserting an item in a Tuple [duplicate]
                            
                                Python (pip) - RequestsDependencyWarning: urllib3 (1.9.1) or chardet (2.3.0) doesn't match a supported version
                            
                                xls to csv converter
                            
                                How to limit a number to be within a specified range? (Python)
                            
                                Clear all widgets in a layout in pyqt
                            
                                Generate unique id in django from a model field
                            
                                ImportError: No module named mysql.connector using Python2
                            
                                Enforcing python version in setup.py
                            
                                Efficient calculation of Fibonacci series
                            
                                Nose unable to find tests in ubuntu
                            
                                Running a test suite with over a million test cases
                            
                                Error packaging Kivy with numpy library for Android using buildozer
                            
                                gensim Doc2Vec vs tensorflow Doc2Vec

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With