I have written a neural network classifier that takes in massive images (~1-3 GB apiece), patches them up, and passes the patches through the network individually. Training was going really slowly, so I benchmarked it and found that it was taking ~50s to load the patches from one image into memory (using the Openslide library), and only ~.5 s to pass them through the model.
However, I'm working on a supercomputer with 1.5Tb of RAM of which only ~26 Gb is being utilized. The dataset is a total of ~500Gb. My thinking is that if we could load the entire dataset into memory it would speed up training tremendously. But I am working with a research team and we are running experiments across multiple Python scripts. So ideally, I would like to load the entire dataset into memory in one script and be able to access it across all scripts.
More details:
.tif
format.I have found many posts about how to share Python objects or raw data in memory across multiple Python scripts:
Server Processes with SyncManager and BaseManager in the multiprocessing module | Example 1 | Example 2 | Docs - Server Processes | Docs - SyncManagers
Manager
object pickles objects before sending them, which could slow things down.mmap module | Docs
mmap
maps the file to virtual memory, not physical memory - it creates a temporary file.Pyro4 (client-server for Python objects) | Docs
The sysv_ipc module for Python. This demo looks promising.
multi-processing
module?I also found this list of options for IPC/networking in Python.
Some discuss server-client setups, some discuss serialization/deserialization, which I'm afraid will take longer than just reading from disk. None of the answers I've found address my question about whether these will result in a performance improvement on I/O.
Not only do we need to share Python objects/memory across scripts; we need to share them across Docker containers.
The Docker documentation explains the --ipc
flag pretty well. What makes sense to me according to the documentation is running:
docker run -d --ipc=shareable data-server
docker run -d --ipc=container:data-server data-client
But when I run my client and server in separate containers with an --ipc
connection set up as described above, they are unable to communicate with each other. The SO questions I've read (1, 2, 3, 4) don't address integration of shared memory between Python scripts in separate Docker containers.
docker run --ipc=<mode>
? (is a shared IPC namespace even the best way to share memory across docker containers?)This is my naive approach to memory sharing between Python scripts in separate containers. It works when the Python scripts are run the same container, but not when they are run in separate containers.
server.py
from multiprocessing.managers import SyncManager
import multiprocessing
patch_dict = {}
image_level = 2
image_files = ['path/to/normal_042.tif']
region_list = [(14336, 10752),
(9408, 18368),
(8064, 25536),
(16128, 14336)]
def load_patch_dict():
for i, image_file in enumerate(image_files):
# We would load the image files here. As a placeholder, we just add `1` to the dict
patches = 1
patch_dict.update({'image_{}'.format(i): patches})
def get_patch_dict():
return patch_dict
class MyManager(SyncManager):
pass
if __name__ == "__main__":
load_patch_dict()
port_num = 4343
MyManager.register("patch_dict", get_patch_dict)
manager = MyManager(("127.0.0.1", port_num), authkey=b"password")
# Set the authkey because it doesn't set properly when we initialize MyManager
multiprocessing.current_process().authkey = b"password"
manager.start()
input("Press any key to kill server".center(50, "-"))
manager.shutdown
client.py
from multiprocessing.managers import SyncManager
import multiprocessing
import sys, time
class MyManager(SyncManager):
pass
MyManager.register("patch_dict")
if __name__ == "__main__":
port_num = 4343
manager = MyManager(("127.0.0.1", port_num), authkey=b"password")
multiprocessing.current_process().authkey = b"password"
manager.connect()
patch_dict = manager.patch_dict()
keys = list(patch_dict.keys())
for key in keys:
image_patches = patch_dict.get(key)
# Do NN stuff (irrelevant)
These scripts work fine for sharing the images when the scripts are run in the same container. But when they are run in separate containers, like this:
# Run the container for the server
docker run -it --name cancer-1 --rm --cpus=10 --ipc=shareable cancer-env
# Run the container for the client
docker run -it --name cancer-2 --rm --cpus=10 --ipc=container:cancer-1 cancer-env
I get the following error:
Traceback (most recent call last):
File "patch_client.py", line 22, in <module>
manager.connect()
File "/usr/lib/python3.5/multiprocessing/managers.py", line 455, in connect
conn = Client(self._address, authkey=self._authkey)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 487, in Client
c = SocketClient(address)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 614, in SocketClient
s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused
Docker containers are allocated 64 MB of shared memory by default.
The ipcMode parameter allows you to configure your containers to share their inter-process communication (IPC) namespace with the other containers in the task, or with the host. The IPC namespace allows containers to communicate directly through shared-memory with other containers running in the same task or host.
Docker containers all run in the same kernel, vs a VM which runs a kernel per guest. So, in terms of which resources in the kernel are shared... really, that would be absolutely everything, except those items which are namespaced away from each other (non-shared mounts, process tree entries, cgroups, etc).
I recommend you try using tmpfs.
It is a linux feature allowing you to create a virtual file system, all of which is stored in the RAM. This allows very fast file access and takes as little as one bash command to set up.
In addition to being very fast and straight-forward, it has many advantages in your case:
cp
the dataset into the tmpfs
tmpfs
can adapt and swap pages to the hard drive. If you will have to run this on a server with no free RAM, you could just have all your files on the hard drive with a normal filesystem and not touch your code at all.Steps to use:
sudo mount -t tmpfs -o size=600G tmpfs /mnt/mytmpfs
cp -r dataset /mnt/mytmpfs
ramfs
might be faster than tmpfs
in some cases as it doesn't implement page swapping. To use it just replace tmpfs
with ramfs
in the instructions above.
I think shared memory
or mmap
solution is proper.
shared memory:
First read dataset in memory in a server process. For python, just use multiprocessing
wrapper to create object in shared memory between process, such as: multiprocessing.Value or multiprocessing.Array, then create Process and pass the shared object as args.
mmap:
Store dataset in a file on host. Then each container mount the file into container. If one container open the file and map the file to its virtual memory, other container will not need to read the file from disk to memory when open the file because the file is already in physical memory.
P.S. I am not sure how cpython implementation large shared memory between process, probably cpython shared memory use mmap
internal.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With