I'm using Ipython Notebook to my research. As my file grows bigger, I constantly extract code out, things like plot method, fitting method etc. I think I need a way to organize this. Is there any good way to do it?? <hr> Currently, I do this by: <pre class="prettyprint"><code>data/ helpers/ my_notebook.ipynb import_file.py </code></pre> I store data at <code>data/</code>, and extract <code>helper method</code> into <code>helpers/</code>, and divide them into files like <code>plot_helper.py</code>, <code>app_helper.py</code>, etc. I summarize the imports in <code>import_file.py</code>, <pre class="prettyprint"><code>from IPython.display import display import numpy as np import scipy as sp import pandas as pd import matplotlib as mpl from matplotlib import pyplot as plt import sklearn import re </code></pre> And then I can import everything I need in <code>.ipynb</code> at top cell as <img src="https://i.stack.imgur.com/XHbi3.png" alt="enter image description here"> The structure can be seen at https://github.com/cqcn1991/Wind-Speed-Analysis One problem I have right now is that I have too many submodule at <code>helpers/</code>, and it's hard to think which method should be put into which file. I think a possible way is to organize in <code>pre-processing</code>, <code>processing</code>, <code>post-processing</code>. UPDATE: My big jupyter research notebook: https://cdn.rawgit.com/cqcn1991/Wind-Speed-Analysis/master/output_HTML/marham.html The top cell is <code>standard import</code> + <code>magic</code> + <code>extentions</code> <pre class="prettyprint"><code>%matplotlib inline %load_ext autoreload %autoreload 2 from __future__ import division from import_file import * load_libs() </code></pre>

There are many ways to organise ipython research project. I am managing a team of 5 Data Scientists and 3 Data Engineers and I found those tips to be working well for our usecase: This is a summary of my PyData London talk: http://www.slideshare.net/vladimirkazantsev/clean-code-in-jupyter-notebook 1. Create a shared (multi-project) utils library You most likely have to reuse/repeat some code in different research projects. Start refactoring those things into "common utils" package. Make setup.py file, push module to github (or similar), so that team members can "pip install" it from VCS. Examples of functionality to put in there are: <ul> <li>Data Warehouse or Storage access functions </li> <li>common plotting functions</li> <li>re-usable math/stats methods</li> </ul> 2. Split your fat master notebook into smaller notebooks In my experience, the good length of file with code (any language) is only few screens (100-400 lines). Jupyter Notebook is still the source file, but with output! Reading a notebook with 20+ cells is very hard. I like my notebooks to have 4-10 cells max. Ideally, each notebook should have one "hypothesis-data-conclusions" triplet. Example of splitting the notebook: 1_data_preparation.ipynb 2_data_validation.ipynb 3_exploratory_plotting.ipynb 4_simple_linear_model.ipynb 5_hierarchical_model.ipynb playground.ipynb Save output of 1_data_preparation.ipynb to pickle <code>df.to_pickle('clean_data.pkl')</code>, csv or fast DB and use <code>pd.read_pickle("clean_data.pkl")</code> at the top of each notebook. 3. It is not Python - it is IPython Notebook What makes notebook unique is cells. Use them well. Each cell should be "idea-execution-output" triplet. If cell does not output anything - combine with the following cell. Import cell should output nothing -this is an expected output for it. If cell have few outputs - it may be worth splitting it. Hiding imports may or may not be good idea: <pre class="prettyprint"><code>from myimports import * </code></pre> Your reader may want to figure out what exactly you are importing to use the same stuff for her research. So use with caution. We do use it for <code>pandas, numpy, matplotlib, sql</code> however. Hiding "secret sauce" in /helpers/model.py is bad: <pre class="prettyprint"><code>myutil.fit_model_and_calculate(df) </code></pre> This may save you typing and you will remove duplicate code, but your collaborator will have to open another file to figure out what's going on. Unfortunately, notebook (jupyter) is quite inflexible and basic environment, but you still don't want to force your reader to leave it for every piece of code. I hope that in the future IDE will improve, but for now, keep "secret sauce" inside a notebook. While "boring and obvious utils" - wherever you see fit. DRY still apply - you have to find the balance. This should not stop you from packaging re-usable code into functions or even small classes. But "flat is better than nested". 4. Keep notebooks clean You should be able to "reset & Run All" at any point in time. Each re-run should be fast! Which means you may have to invest in writing some caching functions. May be you even want to put those into your "common utils" module. Each cell should be executable multiple times, without the need to re-initialise the notebook. This saves you time and keep the code more robust. But it may depend on state created by previous cells. Making each cell completely independent from the cells above is an anti-pattern, IMO. After you are done with research - you are not done with notebook. Refactor. 5. Create a project module, but be very selective If you keep re-using plotting or analytics function - do refactor it into this module. But in my experience, people expect to read and understand a notebook, without opening multiple util sub-modules. So naming your sub-routines well is even more important here, compared to normal Python. "Clean code reads like well written prose" Grady Booch (developer of UML) 6. Host Jupyter server in the cloud for the entire team You will have one environment, so everyone can quickly review and validate research without the need to match the environment (even though conda makes this pretty easy). And you can configure defaults, like mpl style/colors and make matplot lib inline, by default: In <code>~/.ipython/profile_default/ipython_config.py</code> Add line <code>c.InteractiveShellApp.matplotlib = 'inline'</code> 7. (experimental idea) Run a notebook from another notebook, with different parameters Quite often you may want to re-run the whole notebook, but with a different input parameters. To do this, you can structure your research notebook as following: Place params dictionary in the first cell of "source notebook". <pre class="prettyprint"><code>params = dict(platform='iOS', start_date='2016-05-01', retention=7) df = get_data(params ..) do_analysis(params ..) </code></pre> And in another (higher logical level) notebook, execute it using this function: <pre class="prettyprint"><code>def run_notebook(nbfile, **kwargs): """ example: run_notebook('report.ipynb', platform='google_play', start_date='2016-06-10') """ def read_notebook(nbfile): if not nbfile.endswith('.ipynb'): nbfile += '.ipynb' with io.open(nbfile) as f: nb = nbformat.read(f, as_version=4) return nb ip = get_ipython() gl = ip.ns_table['user_global'] gl['params'] = None arguments_in_original_state = True for cell in read_notebook(nbfile).cells: if cell.cell_type != 'code': continue ip.run_cell(cell.source) if arguments_in_original_state and type(gl['params']) == dict: gl['params'].update(kwargs) arguments_in_original_state = False </code></pre> Whether this "design pattern" proves to be useful is yet to be seen. We had some success with it - at least we stopped duplicating notebooks only to change few inputs. Refactoring the notebook into a class or module break quick feedback loop of "idea-execute-output" that cells provide. And, IMHO, is not "ipythonic".. 8. Write (unit) tests for shared library in notebooks and run with py.test There is a Plugin for py.test that can discover and run tests inside notebooks! https://pypi.python.org/pypi/pytest-ipynb

Scientific Computing & Ipython Notebook: How to organize code?

Tags:

python

jupyter-notebook

I'm using Ipython Notebook to my research. As my file grows bigger, I constantly extract code out, things like plot method, fitting method etc.

I think I need a way to organize this. Is there any good way to do it??

Currently, I do this by:

data/ helpers/ my_notebook.ipynb import_file.py

I store data at data/, and extract helper method into helpers/, and divide them into files like plot_helper.py, app_helper.py, etc.

I summarize the imports in import_file.py,

from IPython.display import display  import numpy as np import scipy as sp import pandas as pd import matplotlib as mpl from matplotlib import pyplot as plt import sklearn import re

And then I can import everything I need in .ipynb at top cell as

enter image description here

The structure can be seen at https://github.com/cqcn1991/Wind-Speed-Analysis

One problem I have right now is that I have too many submodule at helpers/, and it's hard to think which method should be put into which file.

I think a possible way is to organize in pre-processing, processing, post-processing.

UPDATE:

My big jupyter research notebook: https://cdn.rawgit.com/cqcn1991/Wind-Speed-Analysis/master/output_HTML/marham.html

The top cell is standard import + magic + extentions

%matplotlib inline %load_ext autoreload %autoreload 2  from __future__ import division from import_file import * load_libs()

252

asked Apr 05 '16 13:04

cqcn1991

2 Answers

There are many ways to organise ipython research project. I am managing a team of 5 Data Scientists and 3 Data Engineers and I found those tips to be working well for our usecase:

This is a summary of my PyData London talk:

http://www.slideshare.net/vladimirkazantsev/clean-code-in-jupyter-notebook

1. Create a shared (multi-project) utils library

You most likely have to reuse/repeat some code in different research projects. Start refactoring those things into "common utils" package. Make setup.py file, push module to github (or similar), so that team members can "pip install" it from VCS.

Examples of functionality to put in there are:

Data Warehouse or Storage access functions
common plotting functions
re-usable math/stats methods

2. Split your fat master notebook into smaller notebooks

In my experience, the good length of file with code (any language) is only few screens (100-400 lines). Jupyter Notebook is still the source file, but with output! Reading a notebook with 20+ cells is very hard. I like my notebooks to have 4-10 cells max.

Ideally, each notebook should have one "hypothesis-data-conclusions" triplet.

Example of splitting the notebook:

1_data_preparation.ipynb

2_data_validation.ipynb

3_exploratory_plotting.ipynb

4_simple_linear_model.ipynb

5_hierarchical_model.ipynb

playground.ipynb

Save output of 1_data_preparation.ipynb to pickle df.to_pickle('clean_data.pkl'), csv or fast DB and use pd.read_pickle("clean_data.pkl") at the top of each notebook.

3. It is not Python - it is IPython Notebook

What makes notebook unique is cells. Use them well. Each cell should be "idea-execution-output" triplet. If cell does not output anything - combine with the following cell. Import cell should output nothing -this is an expected output for it.

If cell have few outputs - it may be worth splitting it.

Hiding imports may or may not be good idea:

from myimports import *

Your reader may want to figure out what exactly you are importing to use the same stuff for her research. So use with caution. We do use it for pandas, numpy, matplotlib, sql however.

Hiding "secret sauce" in /helpers/model.py is bad:

myutil.fit_model_and_calculate(df)

This may save you typing and you will remove duplicate code, but your collaborator will have to open another file to figure out what's going on. Unfortunately, notebook (jupyter) is quite inflexible and basic environment, but you still don't want to force your reader to leave it for every piece of code. I hope that in the future IDE will improve, but for now, keep "secret sauce" inside a notebook. While "boring and obvious utils" - wherever you see fit. DRY still apply - you have to find the balance.

This should not stop you from packaging re-usable code into functions or even small classes. But "flat is better than nested".

4. Keep notebooks clean

You should be able to "reset & Run All" at any point in time.

Each re-run should be fast! Which means you may have to invest in writing some caching functions. May be you even want to put those into your "common utils" module.

Each cell should be executable multiple times, without the need to re-initialise the notebook. This saves you time and keep the code more robust. But it may depend on state created by previous cells. Making each cell completely independent from the cells above is an anti-pattern, IMO.

After you are done with research - you are not done with notebook. Refactor.

5. Create a project module, but be very selective

If you keep re-using plotting or analytics function - do refactor it into this module. But in my experience, people expect to read and understand a notebook, without opening multiple util sub-modules. So naming your sub-routines well is even more important here, compared to normal Python.

"Clean code reads like well written prose" Grady Booch (developer of UML)

6. Host Jupyter server in the cloud for the entire team

You will have one environment, so everyone can quickly review and validate research without the need to match the environment (even though conda makes this pretty easy).

And you can configure defaults, like mpl style/colors and make matplot lib inline, by default:

In ~/.ipython/profile_default/ipython_config.py

Add line c.InteractiveShellApp.matplotlib = 'inline'

7. (experimental idea) Run a notebook from another notebook, with different parameters

Quite often you may want to re-run the whole notebook, but with a different input parameters.

To do this, you can structure your research notebook as following: Place params dictionary in the first cell of "source notebook".

params = dict(platform='iOS',                start_date='2016-05-01',                retention=7) df = get_data(params ..) do_analysis(params ..)

And in another (higher logical level) notebook, execute it using this function:

def run_notebook(nbfile, **kwargs):     """     example:     run_notebook('report.ipynb', platform='google_play', start_date='2016-06-10')     """      def read_notebook(nbfile):         if not nbfile.endswith('.ipynb'):             nbfile += '.ipynb'          with io.open(nbfile) as f:             nb = nbformat.read(f, as_version=4)         return nb      ip = get_ipython()     gl = ip.ns_table['user_global']     gl['params'] = None     arguments_in_original_state = True      for cell in read_notebook(nbfile).cells:         if cell.cell_type != 'code':             continue         ip.run_cell(cell.source)          if arguments_in_original_state and type(gl['params']) == dict:             gl['params'].update(kwargs)             arguments_in_original_state = False

Whether this "design pattern" proves to be useful is yet to be seen. We had some success with it - at least we stopped duplicating notebooks only to change few inputs.

Refactoring the notebook into a class or module break quick feedback loop of "idea-execute-output" that cells provide. And, IMHO, is not "ipythonic"..

8. Write (unit) tests for shared library in notebooks and run with py.test

There is a Plugin for py.test that can discover and run tests inside notebooks!

https://pypi.python.org/pypi/pytest-ipynb

117

answered Sep 17 '22 17:09

volodymyr

While the given answers cover the topic thoroughly it is still worth mentioning Cookiecutter which provides a data science boilerplate project structure:

Cookiecutter Data Sciencee

provides data science template for projects in Python with a logical, reasonably standardized, yet flexible project structure for doing and sharing data science work.

Your analysis doesn't have to be in Python, but the template does provide some Python boilerplate (in the src folder for example, and the Sphinx documentation skeleton in docs). However, nothing is binding.

The following quote from the project description sums it up pretty nicely:

Nobody sits around before creating a new Rails project to figure out where they want to put their views; they just run rails new to get a standard project skeleton like everybody else.

Requirements:

Python 2.7 or 3.5
cookiecutter Python package >= 1.4.0: pip install cookiecutter

Getting started

Starting a new project is as easy as running this command at the command line. No need to create a directory first, the cookiecutter will do it for you.

cookiecutter https://github.com/drivendata/cookiecutter-data-science

Directory structure

├── LICENSE ├── Makefile           <- Makefile with commands like `make data` or `make train` ├── README.md          <- The top-level README for developers using this project. ├── data │   ├── external       <- Data from third party sources. │   ├── interim        <- Intermediate data that has been transformed. │   ├── processed      <- The final, canonical data sets for modeling. │   └── raw            <- The original, immutable data dump. │ ├── docs               <- A default Sphinx project; see sphinx-doc.org for details │ ├── models             <- Trained and serialized models, model predictions, or model summaries │ ├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering), │                         the creator's initials, and a short `-` delimited description, e.g. │                         `1.0-jqp-initial-data-exploration`. │ ├── references         <- Data dictionaries, manuals, and all other explanatory materials. │ ├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc. │   └── figures        <- Generated graphics and figures to be used in reporting │ ├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g. │                         generated with `pip freeze > requirements.txt` │ ├── src                <- Source code for use in this project. │   ├── __init__.py    <- Makes src a Python module │   │ │   ├── data           <- Scripts to download or generate data │   │   └── make_dataset.py │   │ │   ├── features       <- Scripts to turn raw data into features for modeling │   │   └── build_features.py │   │ │   ├── models         <- Scripts to train models and then use trained models to make │   │   │                 predictions │   │   ├── predict_model.py │   │   └── train_model.py │   │ │   └── visualization  <- Scripts to create exploratory and results-oriented visualizations │       └── visualize.py │ └── tox.ini            <- tox file with settings for running tox; see tox.testrun.org

ProjectTemplate - provides a similar system for R data analysis.

answered Sep 19 '22 17:09

wp78de

Related questions
                            
                                How to remove outline of circle marker when using pyplot.plot in matplotlib
                            
                                Use of PunktSentenceTokenizer in NLTK
                            
                                Find and draw the largest contour in opencv on a specific color (Python)
                            
                                aws lambda: Error: Runtime exited with error: signal: killed
                            
                                How to create a draggable legend in matplotlib?
                            
                                How to get the common name for a pytz timezone eg. EST/EDT for America/New_York
                            
                                theano - print value of TensorVariable
                            
                                Nice IDE with GUI designer for wxPython or Tkinter [closed]
                            
                                Parse annotations from a pdf
                            
                                how to generate a graph/diagram like Google Analytics's Visitor Flow?
                            
                                Does Python have a function to reduce fractions?
                            
                                Docstrings vs Comments
                            
                                How to properly add hours to a pandas.tseries.index.DatetimeIndex?
                            
                                How to use bisect.insort_left with a key?
                            
                                How to return a subset of a list that matches a condition [duplicate]
                            
                                Why does “np.inf // 2” result in NaN and not infinity?
                            
                                Global dictionaries don't need keyword global to modify them? [duplicate]
                            
                                Logging training and validation loss in tensorboard
                            
                                How to prefetch data using a custom python function in tensorflow
                            
                                Python with...as for custom context manager

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With