Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Serverless concurrent write access in Python

Are there any packages in Python that support concurrent writes on NFS using a serverless architecture?

I work in an environment where I have a supercomputer, and multiple jobs save their data in parallel. While I can save the result of these computations in separate files, and combine their results later, this requires me to write a reader that is aware of the specific way in which I split my computation across jobs, so that it knows how to stitch everything in a final data structure correctly.

Last time I checked SQLite did not support concurrency in NFS. Are there any alternatives to SQLite?

Note: By serverless I mean avoiding to explicitly start another server (on top of NFS) that handles the IO requests. I understand that NFS uses a client-server architecture, but this filesystem is already part of the supercomputer that I use. I do not need to maintain myself. What I am looking for is a package or file format that supports concurrent IO without requiring me to set-up any (additional) servers.

Example:

Here is an example of two jobs that I would run in parallel:

  • Job 1 populates my_dict from scratch with the following data, and saves it to file :

    my_dict{'a'}{'foo'} = [0.2, 0.3, 0.4]

  • Job 2 also populates my_dict from scratch with the following data, and saves it to file:

    my_dict{'a'}{'bar'} = [0.1, 0.2]

I want to later load file, and see the following in my_dict:

> my_dict{'a'}.items()
[('foo', [0.2, 0.3, 0.4]), ('bar', [2, 3, 5])]

Note that the stitching operation is automatic. In this particular case, I chose to split the keys in my_dict['a'] across the computations, but other splits are possible. The fundamental idea is that there are no clashes between jobs. It implicitly assumes that jobs add/aggregate data, so the fusion of dictionaries (dataframes if using Pandas) always results in aggregating the data, i.e. computing an "outer join" of the data.

like image 932
Josh Avatar asked Nov 25 '13 21:11

Josh


1 Answers

Simple DIY, potentially flaky

Hierarchical locking -- i.e. you lock / first, then lock /foo and unlock /, then lock /foo/bar and unlock /foo. Make changes to /foo/bar and unlock it.

This allows other processes access to other paths. Lock contention on / is relatively small.

Complicated DIY

Adapt a lock-free or wait-free algorithm, e.g. RCU. Pointers become symlinks or files containing lists of other paths.

http://www.rdrop.com/users/paulmck/rclock/intro/rclock_intro.html https://dank.qemfd.net/dankwiki/index.php/Lock-free_algorithms

like image 117
Dima Tisnek Avatar answered Oct 08 '22 20:10

Dima Tisnek