Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elegant way to refer to files in data science project

I have spent few recent days to learn how to structure data science project to keep it simple, reusable and pythonic. Sticking to this guideline I have created my_project. You can see it's structure below.

├── README.md          
├── data
│   ├── processed          <-- data files
│   └── raw                            
├── notebooks  
|   └── notebook_1                             
├── setup.py              
|
├── settings.py            <-- settings file   
└── src                
    ├── __init__.py    
    │
    └── data           
        └── get_data.py    <-- script  

I defined a function that loads data from .data/processed. I want to use this function in other scripts and also in jupyter notebooks located in .notebooks.

def data_sample(code=None):
    df = pd.read_parquet('../../data/processed/my_data')
    if not code:
        code = random.choice(df.code.unique())
    df = df[df.code == code].sort_values('Date')
    return df

Obviously this function won't work anywhere unless I run it directly in the script where it is defined. My idea was to create settings.py where I'd declare:

from os.path import join, dirname

DATA_DIR = join(dirname(__file__), 'data', 'processed')

So now I can write:

from my_project import settings
import os

def data_sample(code=None):
    file_path = os.path.join(settings.DATA_DIR, 'my_data')
    df = pd.read_parquet(file_path)
    if not code:
        code = random.choice(df.code.unique())
    df = df[df.code == code].sort_values('Date')
    return df

Questions:

  1. Is this common practice to refer to files in this way? settings.DATA_DIR looks kinda ugly.

  2. Is this at all how settings.py should be used? And should it be placed in this directory? I have seen it in different spot in this repo under .samr/settings.py

I understand that there might not be 'one right answer', I just try to find logical, elegant way of handling these things.

like image 439
dylan_fan Avatar asked Jun 21 '18 10:06

dylan_fan


People also ask

How do I organize my data science project folder?

A data science project can be divided into four major components: data, figures, code, and products. Make a folder bearing the name of each element and consider placing numbers alongside the file names to make it sortable. Creating directory names and file names on your computer should be a well-thought-out process.

What is the typical structure of a data science folder?

Folder Structure of Data Science Projectproject_name: Name of the project. src: The folder that consists of the source code related to data gathering, data preparation, feature extraction, etc. tests: The folder that consists of the code representing unit tests for code maintained with the src folder.


2 Answers

I'm maintaining a economics data project based on DataDriven Cookiecutter, which I feel is a great template.

Separating you data folders and code seems as an advantage to me, allowing to treat your work as directed flow of tranformations (a 'DAG'), starting with immutable intiial data, and going to interim and final results.

Initially, I reviewed pkg_resources, but declined using it (long syntax and short of understanding cretaing a package) in favour of own helper functions/classes that navigate through directory.

Essentially, the helpers do two things

1. Persist project root folder and some other paths in constansts:

# shorter version 
ROOT = Path(__file__).parents[3]

# longer version
def find_repo_root():
    """Returns root folder for repository.
    Current file is assumed to be:
        <repo_root>/src/kep/helper/<this file>.py
    """
    levels_up = 3
    return Path(__file__).parents[levels_up]

ROOT = find_repo_root()
DATA_FOLDER = ROOT / 'data' 
UNPACK_RAR_EXE = str(ROOT / 'bin' / 'UnRAR.exe')
XL_PATH = str(ROOT / 'output' / 'kep.xlsx')

This is similar to what you do with DATA_DIR. A possible weak point is that here I manually hardcode the relative location of helper file in relation to project root. If the helper file location is moved, this needs to be adjusted. But hey, this the same way it is done in Django.

2. Allow access to specific data in raw, interim and processed folders.

This can be a simple function returning a full path by a filename in a folder, for example:

def interim(filename):
    """Return path for *filename* in 'data/interim folder'."""
    return str(ROOT / 'data' / 'interim' / filename)

In my project I have year-month subfolders for interim and processed directories and I address data by year, month and sometimes frequency. For this data structure I have InterimCSV and ProcessedCSV classes that give reference specific paths, like:

from . helper import ProcessedCSV, InterimCSV
 # somewhere in code
 csv_text = InterimCSV(self.year, self.month).text()
 # later in code
 path = ProcessedCSV(2018,4).path(freq='q')

The code for helper is here. Additionally the classes create subfolders if they are not present (I want this for unittest in temp directory), and there are methods for checking files exist and for reading their contents.

In your example, you can easily have root directory fixed in setting.py, but I think you can go a step forward with abstracting your data.

Currently data_sample() mixes file access and data transformations, not a great sign, and also uses a global name, another bad sign for a function. I suggest you may consider following:

# keep this in setting.py
def processed(filename):
   return os.path.join(DATA_DIR, filename)

# this works on a dataframe - your argument is a dataframe,
# and you return a dataframe
def transform_sample(df: pd.DataFrame, code=None) -> pd.DataFrame:
    # FIXME: what is `code`?
    if not code:
        code = random.choice(df.code.unique())
    return df[df.code == code].sort_values('Date')

# make a small but elegant pipeline of data transfomation
file_path = processed('my_data')
df0 = pd.read_parquet(file_path)
df = transform_sample(df0)
like image 124
Evgeny Avatar answered Oct 09 '22 08:10

Evgeny


As long as you are not committing lots of data and you make clear the difference between snapshots of the uncontrolled outside world and your own derived data (code + raw) == state. It is sometimes useful to use append-only-ish raw and think about symlinking steps like raw/interesting_source/2018.csv.gz -> raw_appendonly/interesting_source/2018.csv.gz.20180401T12:34:01 or some similar pattern to establish a "use latest" input structure. Try to clearly separate config settings (my_project/__init__.py, config.py, settings.py or whatever) that might need to be changed depending on env (imagine swapping out fs for blobstore or whatever). setup.py is usually at the top level my_project/setup.py and anything related to runnable stuff (not docs, examples not sure) in my_project/my_project. Define one _mydir = os.path.dirname(os.path.realpath(__file__)) in one place (config.py) and rely on that to avoid refactoring pain.

like image 45
mathtick Avatar answered Oct 09 '22 08:10

mathtick