We have about 10 Python processes running on a Linux box, all reading the same large data-structure (which happens to be a Pandas DataFrame
, essentially a 2D numpy matrix).
These processes must respond to queries as quickly as possible, and keeping the data on disk is simply not fast enough for our needs anymore.
What we really need is for all the processes to have full random access to the data-structure in memory, so they can retrieve all elements necessary to perform their arbitrary calculations.
We cannot duplicate the data-structure 10 times (or even twice) in-memory due to its size.
Is there a way all 10 Python processes can share random access to the data-structure in memory?
Because Linux supports Copy-on-Write (COW) on fork()
, data is not copied
unless it is written to.
Therefore, if you define the DataFrame, df
in the global namespace, then you
can access it from as many subsequently spawned subprocesses as you wish,
and no extra memory for the DataFrame is required.
Only if one of the subprocesses modifies df
(or data on the same memory page as df
) is the data (on that memory page) copied.
So, as strange as it may sound, you don't have to do anything special on Linux to share access to a large in-memory data structure among subprocesses except define the data in the global namespace before spawning the subprocesses.
Here is some code demonstrating Copy-on-Write behavior.
When data gets modified, the memory page on which it resides gets copied. As described in this PDF:
Each process has a page table which maps its virtual addresses to physical addresses; when the fork() operation is performed, the new process has a new page table created in which each entry is marked with a ‘‘copy-on- write’’ flag; this is also done for the caller’s address space. When the contents of memory are to be updated, the flag is checked. If it is set, a new page is allocated, the data from the old page copied, the update is made on the new page, and the ‘‘copy-on-write’’ flag is cleared for the new page.
Thus, if there is an update to some value on a memory page, that page is copied. If part of a large DataFrame resides on that memory page, then only that part gets copied, not the whole DataFrame. By default the page size is usually 4 KiB but can be larger depending on how the MMU is configured.
Type
% getconf PAGE_SIZE
4096
to find the page size (in bytes) on your system.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With