Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Machine Learning/Data Science Project Structure

I'm looking for information on how should a Python Machine Learning project be organized. For Python usual projects there is Cookiecutter and for R ProjectTemplate.

This is my current folder structure, but I'm mixing Jupyter Notebooks with actual Python code and it does not seems very clear.

.
├── cache
├── data
├── my_module
├── logs
├── notebooks
├── scripts
├── snippets
└── tools

I work in the scripts folder and currently adding all the functions in files under my_module, but that leads to errors loading data(relative/absolute paths) and other problems.

I could not find proper best practices or good examples on this topic besides this kaggle competition solution and some Notebooks that have all the functions condensed at the start of such Notebook.

like image 514
David Gasquez Avatar asked Jun 05 '26 01:06

David Gasquez


2 Answers

We've started a cookiecutter-data-science project designed for Python data scientists that might be of interest to you, check it out here. Structure is explained here.

Would love feedback if you have it! Feel free to respond here, open PRs or file issues.


In response to your issue about re-using code by importing .py files into notebooks, the most effective way that our team has found is to append to the system path. This may make some people cringe, but it seems like the cleanest way of importing code into a notebook without lots of module boilerplate and a pip -e install.

One tip is to use the %autoreload and %aimport magics with the above. Here's an example:

# Load the "autoreload" extension
%load_ext autoreload

# always reload modules marked with "%aimport"
%autoreload 1

import os
import sys

# add the 'src' directory as one where we can import modules
src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)

# import my method from the source code
%aimport preprocess.build_features

The above code comes from section 3.5 in this notebook for some context.

like image 74
hume Avatar answered Jun 06 '26 14:06

hume


You may want to look at:

http://tshauck.github.io/Gloo/

loo's goal is to tie together a lot of the data analysis actions that happen regularly and make that processes easy. Automatically loading data into the ipython environment, running scripts, making utitlity functions available and more. These are things that have to be done often, but aren't the fun part.

It's not actively maintained but the basics are there.

like image 29
DaCoEx Avatar answered Jun 06 '26 14:06

DaCoEx



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!