Structuring ocean modelling code in python

Question

I am starting to use python for numerical simulation, and in particular I am starting from this project to build mine, that will be more complicated than this since I will have to try a lot of different methods and configurations. I work full time on Fortran90 codes and Matlab codes, and those are the two languages I am "mother tongue". In those two languages one is free to structure the code as he wants, and I am trying to mimic this feature because in my field (computation oceanography) things gets rather complicated easily. See as an example the code I daily work with, NEMO (here the main page, here the source code). The source code (of NEMO) is conveniently divided in folders, each of which contains modules and methods for a specific task (e.g. the domain discretisation routines are in folder DOM, the vertical physics is in the folder ZDF, the lateral physics in LDF and so on), this because the processes (physical or purely mathematical) involved are completely different. What I am trying to build is this

/shallow_water_model -
- create_conf.py (creates a new subdirectory in /cfgs with a given name, like "caspian_sea" or "mediterranean_sea" and copies the content of the folder /src inside this new subdirectory to create a new configuration)
- /cfgs -
  - /caspian_sea (example configuration)
  - /mediterranean_sea (example configuration)
- /src -
  - swm_main.py (initialize a dictionary and calls the functions)
  - swm_param.py (fills the dictionary)
  - /domain -
    - swm_grid.py (creates a numerical grid)
  - /dynamics -
    - swm_adv.py (create advection matrix)
    - swm_dif.py (create diffusion matrix)
  - /solver -
    - swm_rk4.py (time stepping with Runge-Kutta4)
    - swm_pc.py (time stepping with predictor corrector)
  - /IO -
    - swm_input.py (handles netCDF input)
    - sim_output.py (handles netCDF output)

The script create_conf.py contains the following structure, and it is supposed to take a string input from the terminal, create a folder with that name and copy all the files and subdirectories of /src folder inside, so one can put there all the input files of this configuration and eventually modify the source code to create an ad-hoc source code for the configuration. This duplication of the source code is common in the ocean modelling community because two different configuration (like the Mediterranean Sea and the Caspian Sea) may differ not only in the input files (like topography, coastlines etc etc) but also in the modelling itself, meaning that the modification you need to make to the source code for each configuration might be substantial. (Most ocean models allow you to put your own modified source files in specific folders and they are instructed to overwrite the specific files at compilation. My code is going to be simple enough to just duplicate the source code.)

import os, sys
import shutil
def create_conf(conf_name="new_config"):
    cfg_dir = os.getcwd() + "/cfgs/"  
    # Check if configuration exists
    try:
        os.makedirs(cfg_dir + conf_name)
        print("Configuration " + conf_name + " correctly created")
    except FileExistsError:
        # directory already exists
        # Handles overwriting, duplicates or stop
    # make a copy of "/src" into the new folder
    return

# This is supposed to be used directly from the terminal
if __name__ == '__main__':
    filename = sys.argv[1]
    create_conf(filename)

The script swm_main.py can be thought as a list of calls to the necessary routines depending on the kind of process you want to take into account, just like

import numpy as np
from DOM.swm_domain import set_grid
from swm_param import set_param, set_timestep, set_viscosity
# initialize dictionary (i.e. structure) containing all the parameters of the run 
global param
param = dict()
# define the parameters (i.e. call swm_param.py)
set_param(param)
# Create the grid
set_grid(param)

The two routines called just take a particular field of param and assign it a value, like

import numpy as np
import os
def set_param(param):
    param['nx'] = 32               # number of grid points in x-direction
    param['ny'] = 32               # number of grid points in y-direction  
    return param

Now, the main topic of discussion is how to achieve this kind of structure in python. I almost always find source codes that are either monolithic (all routines in the same file) or a sequence of files in the same folders. I want to have some better organisation, but the solution I found browsing fills every subfolder in /src with a folder __pycache__ and I need to put a __init__.py file in each folder. I don't know why but these two things make me think there is something sloppy in this approach. Moreover, I need to import modules (like numpy) in every file, and I was wondering whether this was efficient or not.

What do you think would be better to keep this structuring and keep it as simple as possible? Thanks for your help

2e0byo · Accepted Answer

As I understand the actual question here is:

the solution I found browsing fills every subfolder in /src with a folder __pycache__ and I need to put a __init__.py file in each folder... this makes me think there is something sloppy in this approach.

There is nothing sloppy or unpythonic about making your code into packages. In order to be able to import from .py files in a directory, one of two conditions has to be satisfied:

the directory must be in your sys.path, or
the directory must be a package, and that package must be a sub-directory of some directory in your sys.path (or a sub-directory of a package which is a sub-directory of some directory in your sys.path)

The first solution is generally hacky in code, although often appropriate in tests, and involves modifying sys.path to add every dir you want. This is generally hacky because the whole point of putting your code inside a package is that the package structure encodes some natural division in the source: e.g. a package modeller is conceptually distinct from a package quickgui, and each could be used independently of each other in different programs.

The easiest[1] way to make a directory into a package is to place an __init__.py in it. The file should contain anything which belongs conceptually at the package level, i.e. not in modules. It may be appropriate to leave it empty, but it's often a good idea to import the public functions/classes/vars from your modules, so you can do from mypkg import thing rather than from mypkg.module import thing. Packages should be conceptually complete, which normally means you should be able (in theory) to use them from multiple places. Sometimes you don't want a separate package: you just want a naming convention, like gui_tools.py gui_constants.py, model_tools.py, model_constants.py, etc. The __pycache__ folder is simply python caching the bytecode to make future imports faster: you can move that or prevent it, but just add *__pycache__* to your .gitignore and forget about them.

Lastly, since you come from very different languages:

lots of python code written by scientists (rather than programmers) is quite unpythonic IMHO. Billion line long single python files is not good style[2]. Python prefers readability, always: call things derived_model not dm1. If you do that you may well find you don't need as many dirs as you thought.
importing the same module in every file is a trivial cost: python imports once: every other import is just another name bound in sys.modules. Always import explicitly.
in general stop worrying about performance in python. Write your code as clearly as possible, then profile it if you need to, and find what is slow. Python is so high level that micro-optimisations learned in compiled languages will probably backfire.
lastly, and this is mostly personal, don't give folders/modules names in CAPITALS. FORTRAN might encourage that, and it was written on machines which often didn't have case sensitivity for filenames, but we no longer have those constraints. In python we reserve capitals for constants, so I find it plain weird when I have to modify or execute something in capitals. Likewise 'DOM' made me think of the document object model which is probably not what you mean here.

References

[1] Python does have implicit namespace packages but you are still better off with explicit packages to signal your intention to make a package (and to avoid various importing problems).

[2] See pep8 for some more conventions on how you structure things. I would also recommend looking at some decent general-purpose libraries to see how they do things: they tend to be written by mainstream programmers who focus on writing clean, maintainable code, rather than by scientists who focus on solving highly specific (and frequently very complicated) problems.

Structuring ocean modelling code in python

Tags:

python

Kimala

1 Answers

References

2e0byo

Recent Activity

Donate For Us

Structuring ocean modelling code in python

Tags:

python

Kimala

1 Answers

References

2e0byo

Related questions

Recent Activity

Donate For Us