I'm using mpi4py to element-wise reduce a numpy array across multiple processes. The idea is that the numpy arrays get summed element-wise, so that if I have two processes, and each has arrays:
Rank 0: [1, 1, 1]
Rank 1: [2, 3, 4]
after reduction I should have
[3, 4, 5]
This case, with such short arrays, works fine.
However, in my real use-case these arrays are quite long (array_length
in my example code below). I have no problems if I send numpy arrays of length less than or equal to 505 elements, but above that, I get the following output:
[83621b291fb8:01112] Read -1, expected 4048, errno = 1
and I've been unable to find any documented reason why this might be. Interestingly, however, 506*8 = 4048, which - assuming some header data - makes me suspect I'm hitting a 4kb buffer limit somewhere inside mpi4py or MPI itself.
I've managed to work around this problem by breaking down the numpy array I want to element-wise reduce into chunks of size 200 (just an arbitrary number less than 505), and calling Reduce() on each chunk, then reassembling on the master process. However, this is somewhat slow.
Does anyone know if this is indeed due to a 4kb buffer limit (or similar) in mpi4py/MPI?
Is there a better solution than slicing the array into pieces and making many calls to Reduce() as I am currently doing, as this seems a bit slow to run.
Below is code that illustrates
use_slices
boolean)With case=0
and use_slices=False
, the error can be seen (array length 506)
With case=1
and use_slices=False
, the error vanishes (array length 505)
With use_slices=True
, the error vanishes, regardless of case
, and even if case
is set to a very long array (case=2
)
import mpi4py, mpi4py.MPI
import numpy as np
###### CASE FLAGS ########
# Whether or not to break the array into 200-element pieces
# before calling MPI Reduce()
use_slices = False
# The total length of the array to be reduced:
case = 0
if case == 0:
array_length= 506
elif case == 1:
array_length= 505
elif case == 2:
array_length= 1000000
comm = mpi4py.MPI.COMM_WORLD
rank = comm.Get_rank()
nprocs = comm.Get_size()
array_to_reduce = np.ones(array_length)*(rank+1) #just some different numbers per rank
reduced_array = np.zeros(array_length)
if not use_slices:
comm.Reduce(array_to_reduce,
reduced_array,
op = mpi4py.MPI.SUM,
root = 0)
if rank==0:
print(reduced_array)
else: # in this case, use_slices is True
array_slice_length = 200
sliced_array = np.array_split(array_to_reduce, range(200, array_length, 200))
reduced_array_using_slices = np.array([])
for array_slice in sliced_array:
returnedval = np.zeros(shape=array_slice.shape)
comm.Reduce(array_slice,
returnedval,
op = mpi4py.MPI.SUM,
root = 0)
reduced_array_using_slices=np.concatenate((reduced_array_using_slices, returnedval))
comm.Barrier()
if rank==0:
print(reduced_array_using_slices)
Compiled from source -
openmpi 3.1.4
mpi4py 3.0.3
This is not a problem with mpi4py
per se. The issue comes from the Cross-Memory Attach (CMA) system calls process_vm_readv()
and process_vm_writev()
that the shared-memory BTLs (Byte Transfer Layers, a.k.a. the things that move bytes between ranks) of Open MPI use to accelerate shared-memory communication between ranks that run on the same node by avoiding copying the data twice to and from a shared-memory buffer. This mechanism involves some setup overhead and is therefore only used for larger messages, which is why the problem only starts occurring after the messages size crosses the eager threshold.
CMA is part of the ptrace
family of kernel services. Docker uses seccomp
to limit what system calls can be made by processes running inside the container. The default profile has the following:
{
"names": [
"kcmp",
"process_vm_readv",
"process_vm_writev",
"ptrace"
],
"action": "SCMP_ACT_ALLOW",
"args": [],
"comment": "",
"includes": {
"caps": [
"CAP_SYS_PTRACE"
]
},
"excludes": {}
},
limiting ptrace
-related syscalls to containers that have the CAP_SYS_PTRACE
capability, which is not among the capabilities granted by default. Therefore, to enable the normal functioning of Open MPI in Docker, one needs to grant the required capability by calling docker run
with the following additional option:
--cap-add=SYS_PTRACE
This will allow Open MPI to function properly, but enabling ptrace
may present a security risk in certain container deployments. Therefore, an alternative is to disable the use of CMA by Open MPI. This is achieved by setting an MCA parameter depending on the version of Open MPI and the shared-memory BTL used:
sm
BTL (default before Open MPI 1.8): --mca btl_sm_use_cma 0
vader
BTL (default since Open MPI 1.8): --mca btl_vader_single_copy_mechanism none
Disabling the single-copy mechanism will force the BTL to use pipelined copy through a shared-memory buffer, which may or may not affect the run time of the MPI job.
Read here about the shared-memory BTLs and the zero(single?)-copy mechanisms in Open MPI.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With