I have spent few recent days to learn how to structure data science project to keep it simple, reusable and pythonic. Sticking to this guideline I have created <code>my_project</code>. You can see it's structure below. <pre class="prettyprint"><code>├── README.md ├── data │ ├── processed <-- data files │ └── raw ├── notebooks | └── notebook_1 ├── setup.py | ├── settings.py <-- settings file └── src ├── __init__.py │ └── data └── get_data.py <-- script </code></pre> I defined a function that loads data from <code>.data/processed</code>. I want to use this function in other scripts and also in jupyter notebooks located in .notebooks. <pre class="prettyprint"><code>def data_sample(code=None): df = pd.read_parquet('../../data/processed/my_data') if not code: code = random.choice(df.code.unique()) df = df[df.code == code].sort_values('Date') return df </code></pre> Obviously this function won't work anywhere unless I run it directly in the script where it is defined. My idea was to create <code>settings.py</code> where I'd declare: <pre class="prettyprint"><code>from os.path import join, dirname DATA_DIR = join(dirname(__file__), 'data', 'processed') </code></pre> So now I can write: <pre class="prettyprint"><code>from my_project import settings import os def data_sample(code=None): file_path = os.path.join(settings.DATA_DIR, 'my_data') df = pd.read_parquet(file_path) if not code: code = random.choice(df.code.unique()) df = df[df.code == code].sort_values('Date') return df </code></pre> Questions: <ol> <li>Is this common practice to refer to files in this way? <code>settings.DATA_DIR</code> looks kinda ugly.</li> <li>Is this at all how <code>settings.py</code> should be used? And should it be placed in this directory? I have seen it in different spot in this repo under <code>.samr/settings.py</code></li> </ol> I understand that there might not be 'one right answer', I just try to find logical, elegant way of handling these things.

As long as you are not committing lots of data and you make clear the difference between snapshots of the uncontrolled outside world and your own derived data (code + <code>raw</code>) == state. It is sometimes useful to use append-only-ish raw and think about symlinking steps like <code>raw/interesting_source/2018.csv.gz -> raw_appendonly/interesting_source/2018.csv.gz.20180401T12:34:01</code> or some similar pattern to establish a "use latest" input structure. Try to clearly separate config settings (<code>my_project/__init__.py</code>, <code>config.py</code>, <code>settings.py</code> or whatever) that might need to be changed depending on env (imagine swapping out fs for blobstore or whatever). setup.py is usually at the top level <code>my_project/setup.py</code> and anything related to runnable stuff (not docs, examples not sure) in <code>my_project/my_project</code>. Define one <code>_mydir = os.path.dirname(os.path.realpath(__file__))</code> in one place (<code>config.py</code>) and rely on that to avoid refactoring pain.

Elegant way to refer to files in data science project

Tags:

python

python-3.x

setuptools

directory-structure

I have spent few recent days to learn how to structure data science project to keep it simple, reusable and pythonic. Sticking to this guideline I have created my_project. You can see it's structure below.

├── README.md          
├── data
│   ├── processed          <-- data files
│   └── raw                            
├── notebooks  
|   └── notebook_1                             
├── setup.py              
|
├── settings.py            <-- settings file   
└── src                
    ├── __init__.py    
    │
    └── data           
        └── get_data.py    <-- script

I defined a function that loads data from .data/processed. I want to use this function in other scripts and also in jupyter notebooks located in .notebooks.

def data_sample(code=None):
    df = pd.read_parquet('../../data/processed/my_data')
    if not code:
        code = random.choice(df.code.unique())
    df = df[df.code == code].sort_values('Date')
    return df

Obviously this function won't work anywhere unless I run it directly in the script where it is defined. My idea was to create settings.py where I'd declare:

from os.path import join, dirname

DATA_DIR = join(dirname(__file__), 'data', 'processed')

So now I can write:

from my_project import settings
import os

def data_sample(code=None):
    file_path = os.path.join(settings.DATA_DIR, 'my_data')
    df = pd.read_parquet(file_path)
    if not code:
        code = random.choice(df.code.unique())
    df = df[df.code == code].sort_values('Date')
    return df

Questions:

Is this common practice to refer to files in this way? settings.DATA_DIR looks kinda ugly.
Is this at all how settings.py should be used? And should it be placed in this directory? I have seen it in different spot in this repo under .samr/settings.py

I understand that there might not be 'one right answer', I just try to find logical, elegant way of handling these things.

439

asked Jun 21 '18 10:06

dylan_fan

2 Answers

I'm maintaining a economics data project based on DataDriven Cookiecutter, which I feel is a great template.

Separating you data folders and code seems as an advantage to me, allowing to treat your work as directed flow of tranformations (a 'DAG'), starting with immutable intiial data, and going to interim and final results.

Initially, I reviewed pkg_resources, but declined using it (long syntax and short of understanding cretaing a package) in favour of own helper functions/classes that navigate through directory.

Essentially, the helpers do two things

1. Persist project root folder and some other paths in constansts:

# shorter version 
ROOT = Path(__file__).parents[3]

# longer version
def find_repo_root():
    """Returns root folder for repository.
    Current file is assumed to be:
        <repo_root>/src/kep/helper/<this file>.py
    """
    levels_up = 3
    return Path(__file__).parents[levels_up]

ROOT = find_repo_root()
DATA_FOLDER = ROOT / 'data' 
UNPACK_RAR_EXE = str(ROOT / 'bin' / 'UnRAR.exe')
XL_PATH = str(ROOT / 'output' / 'kep.xlsx')

This is similar to what you do with DATA_DIR. A possible weak point is that here I manually hardcode the relative location of helper file in relation to project root. If the helper file location is moved, this needs to be adjusted. But hey, this the same way it is done in Django.

2. Allow access to specific data in raw, interim and processed folders.

This can be a simple function returning a full path by a filename in a folder, for example:

def interim(filename):
    """Return path for *filename* in 'data/interim folder'."""
    return str(ROOT / 'data' / 'interim' / filename)

In my project I have year-month subfolders for interim and processed directories and I address data by year, month and sometimes frequency. For this data structure I have InterimCSV and ProcessedCSV classes that give reference specific paths, like:

from . helper import ProcessedCSV, InterimCSV
 # somewhere in code
 csv_text = InterimCSV(self.year, self.month).text()
 # later in code
 path = ProcessedCSV(2018,4).path(freq='q')

The code for helper is here. Additionally the classes create subfolders if they are not present (I want this for unittest in temp directory), and there are methods for checking files exist and for reading their contents.

In your example, you can easily have root directory fixed in setting.py, but I think you can go a step forward with abstracting your data.

Currently data_sample() mixes file access and data transformations, not a great sign, and also uses a global name, another bad sign for a function. I suggest you may consider following:

# keep this in setting.py
def processed(filename):
   return os.path.join(DATA_DIR, filename)

# this works on a dataframe - your argument is a dataframe,
# and you return a dataframe
def transform_sample(df: pd.DataFrame, code=None) -> pd.DataFrame:
    # FIXME: what is `code`?
    if not code:
        code = random.choice(df.code.unique())
    return df[df.code == code].sort_values('Date')

# make a small but elegant pipeline of data transfomation
file_path = processed('my_data')
df0 = pd.read_parquet(file_path)
df = transform_sample(df0)

124

answered Oct 09 '22 08:10

Evgeny

As long as you are not committing lots of data and you make clear the difference between snapshots of the uncontrolled outside world and your own derived data (code + raw) == state. It is sometimes useful to use append-only-ish raw and think about symlinking steps like raw/interesting_source/2018.csv.gz -> raw_appendonly/interesting_source/2018.csv.gz.20180401T12:34:01 or some similar pattern to establish a "use latest" input structure. Try to clearly separate config settings (my_project/__init__.py, config.py, settings.py or whatever) that might need to be changed depending on env (imagine swapping out fs for blobstore or whatever). setup.py is usually at the top level my_project/setup.py and anything related to runnable stuff (not docs, examples not sure) in my_project/my_project. Define one _mydir = os.path.dirname(os.path.realpath(__file__)) in one place (config.py) and rely on that to avoid refactoring pain.

answered Oct 09 '22 08:10

mathtick

Related questions
                            
                                How to connect django to docker redis container?
                            
                                OpenCV - specify format while writing image to file (cv2.imwrite)
                            
                                How to plot Pandas datetime series in Seaborn distplot?
                            
                                Difference between numpy ediff1d and diff
                            
                                How can I get a similar summary of a Pandas dataframe as in R?
                            
                                How to determine the number of interned strings in Python 2.7.5?
                            
                                Is there a way compile protocol buffers into pure python code?
                            
                                Reading csv from S3 and inserting into a MySQL table with AWS Lambda
                            
                                How to establish a SSH connection via proxy using Fabric?
                            
                                TensorFlow tf.reshape Fortran order (like numpy)
                            
                                It is possible to generate sequence diagram from python code?
                            
                                CeleryBeat Process consumes all OS memory
                            
                                Pylint message about module length reasoning and ratio of docstrings to lines of code
                            
                                Beautiful Soup Select Vs Find_all data Type
                            
                                Starting Kivy service on bootup (Android)
                            
                                How to interpret output of .predict() from fitted scikit-survival model in python?
                            
                                Not able Running/deploying custom script with shub-image
                            
                                Tensorflow Object Detection API on Windows - error "ModuleNotFoundError: No module named 'utils'"
                            
                                What's the ideal way to include dictionaries (gazetteer) in spaCy to improve NER?
                            
                                Nested (double) row by row iteration of a Pandas DataFrame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With