I have a pipeline which at some point splits work into various sub-processes that do the same thing in parallel. Thus their output should go into the same file.
Is it too risky to say all of those processes should write into the same file? Or does python try and retry if it sees that this resource is occupied?
no, generally it is not safe to do this! you need to obtain an exclusive write lock for each process -- that implies that all the other processes will have to wait while one process is writing to the file.. the more I/O intensive processes you have, the longer the wait time.
It happens nothing. Once the script is loaded in memory and running it will keep like this. An "auto-reloading" feature can be implemented anyway in your code, like Flask and other frameworks does.
You can create a fourth python file d.py in the same folder as other 3 python files, which imports the other 3 python files and runs their functions, as shown below. In this article, we have learnt how to run multiple python files.
This is system dependent. In Windows, the resource is locked and you get an exception. In Linux you can write the file with two processes (written data could be mixed)
Ideally in such cases you should use semaphores to synchronize access to shared resources.
If using semaphores is too heavy for your needs, then the only alternative is to write in separate files...
Edit: As pointed out by eye in a later post, a resource manager is another alternative to handle concurrent writers
In general, this is not a good idea and will take a lot of care to get right. Since the writes will have to be serialized, it might also adversely affect scalability.
I'd recommend writing to separate files and merging (or just leaving them as separate files).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With