I have spent few recent days to learn how to structure data science project to keep it simple, reusable and pythonic. Sticking to this guideline I have created my_project
. You can see it's structure below.
├── README.md
├── data
│ ├── processed <-- data files
│ └── raw
├── notebooks
| └── notebook_1
├── setup.py
|
├── settings.py <-- settings file
└── src
├── __init__.py
│
└── data
└── get_data.py <-- script
I defined a function that loads data from .data/processed
. I want to use this function in other scripts and also in jupyter notebooks located in .notebooks.
def data_sample(code=None):
df = pd.read_parquet('../../data/processed/my_data')
if not code:
code = random.choice(df.code.unique())
df = df[df.code == code].sort_values('Date')
return df
Obviously this function won't work anywhere unless I run it directly in the script where it is defined.
My idea was to create settings.py
where I'd declare:
from os.path import join, dirname
DATA_DIR = join(dirname(__file__), 'data', 'processed')
So now I can write:
from my_project import settings
import os
def data_sample(code=None):
file_path = os.path.join(settings.DATA_DIR, 'my_data')
df = pd.read_parquet(file_path)
if not code:
code = random.choice(df.code.unique())
df = df[df.code == code].sort_values('Date')
return df
Questions:
Is this common practice to refer to files in this way? settings.DATA_DIR
looks kinda ugly.
Is this at all how settings.py
should be used? And should it be placed in this directory? I have seen it in different spot in this repo under .samr/settings.py
I understand that there might not be 'one right answer', I just try to find logical, elegant way of handling these things.
A data science project can be divided into four major components: data, figures, code, and products. Make a folder bearing the name of each element and consider placing numbers alongside the file names to make it sortable. Creating directory names and file names on your computer should be a well-thought-out process.
Folder Structure of Data Science Projectproject_name: Name of the project. src: The folder that consists of the source code related to data gathering, data preparation, feature extraction, etc. tests: The folder that consists of the code representing unit tests for code maintained with the src folder.
I'm maintaining a economics data project based on DataDriven Cookiecutter, which I feel is a great template.
Separating you data folders and code seems as an advantage to me, allowing to treat your work as directed flow of tranformations (a 'DAG'), starting with immutable intiial data, and going to interim and final results.
Initially, I reviewed pkg_resources
, but declined using it (long syntax and short of understanding cretaing a package) in favour of own helper functions/classes that navigate through directory.
Essentially, the helpers do two things
1. Persist project root folder and some other paths in constansts:
# shorter version
ROOT = Path(__file__).parents[3]
# longer version
def find_repo_root():
"""Returns root folder for repository.
Current file is assumed to be:
<repo_root>/src/kep/helper/<this file>.py
"""
levels_up = 3
return Path(__file__).parents[levels_up]
ROOT = find_repo_root()
DATA_FOLDER = ROOT / 'data'
UNPACK_RAR_EXE = str(ROOT / 'bin' / 'UnRAR.exe')
XL_PATH = str(ROOT / 'output' / 'kep.xlsx')
This is similar to what you do with DATA_DIR
. A possible weak point is that here I
manually hardcode the relative location of helper file in relation to project root. If the helper file location is moved, this needs to be adjusted. But hey, this the same way it is done in Django.
2. Allow access to specific data in raw
, interim
and processed
folders.
This can be a simple function returning a full path by a filename in a folder, for example:
def interim(filename):
"""Return path for *filename* in 'data/interim folder'."""
return str(ROOT / 'data' / 'interim' / filename)
In my project I have year-month subfolders for interim
and processed
directories and I address data by year, month and sometimes frequency. For this data structure I have
InterimCSV
and ProcessedCSV
classes that give reference specific paths, like:
from . helper import ProcessedCSV, InterimCSV
# somewhere in code
csv_text = InterimCSV(self.year, self.month).text()
# later in code
path = ProcessedCSV(2018,4).path(freq='q')
The code for helper is here. Additionally the classes create subfolders if they are not present (I want this for unittest in temp directory), and there are methods for checking files exist and for reading their contents.
In your example, you can easily have root directory fixed in setting.py
,
but I think you can go a step forward with abstracting your data.
Currently data_sample()
mixes file access and data transformations, not a great sign, and also uses a global name, another bad sign for a function. I suggest you may consider following:
# keep this in setting.py
def processed(filename):
return os.path.join(DATA_DIR, filename)
# this works on a dataframe - your argument is a dataframe,
# and you return a dataframe
def transform_sample(df: pd.DataFrame, code=None) -> pd.DataFrame:
# FIXME: what is `code`?
if not code:
code = random.choice(df.code.unique())
return df[df.code == code].sort_values('Date')
# make a small but elegant pipeline of data transfomation
file_path = processed('my_data')
df0 = pd.read_parquet(file_path)
df = transform_sample(df0)
As long as you are not committing lots of data and you make clear the difference between snapshots of the uncontrolled outside world and your own derived data (code + raw
) == state. It is sometimes useful to use append-only-ish raw and think about symlinking steps like raw/interesting_source/2018.csv.gz -> raw_appendonly/interesting_source/2018.csv.gz.20180401T12:34:01
or some similar pattern to establish a "use latest" input structure. Try to clearly separate config settings (my_project/__init__.py
, config.py
, settings.py
or whatever) that might need to be changed depending on env (imagine swapping out fs for blobstore or whatever). setup.py is usually at the top level my_project/setup.py
and anything related to runnable stuff (not docs, examples not sure) in my_project/my_project
. Define one _mydir = os.path.dirname(os.path.realpath(__file__))
in one place (config.py
) and rely on that to avoid refactoring pain.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With