Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Possible buffer size limit in mpi4py Reduce()

The Setup

I'm using mpi4py to element-wise reduce a numpy array across multiple processes. The idea is that the numpy arrays get summed element-wise, so that if I have two processes, and each has arrays:

Rank 0: [1, 1, 1]
Rank 1: [2, 3, 4]

after reduction I should have

[3, 4, 5]

This case, with such short arrays, works fine.

The Problem

However, in my real use-case these arrays are quite long (array_length in my example code below). I have no problems if I send numpy arrays of length less than or equal to 505 elements, but above that, I get the following output:

[83621b291fb8:01112] Read -1, expected 4048, errno = 1

and I've been unable to find any documented reason why this might be. Interestingly, however, 506*8 = 4048, which - assuming some header data - makes me suspect I'm hitting a 4kb buffer limit somewhere inside mpi4py or MPI itself.

One Possible Work-Around

I've managed to work around this problem by breaking down the numpy array I want to element-wise reduce into chunks of size 200 (just an arbitrary number less than 505), and calling Reduce() on each chunk, then reassembling on the master process. However, this is somewhat slow.

My Questions:

  1. Does anyone know if this is indeed due to a 4kb buffer limit (or similar) in mpi4py/MPI?

  2. Is there a better solution than slicing the array into pieces and making many calls to Reduce() as I am currently doing, as this seems a bit slow to run.


Some Examples

Below is code that illustrates

  1. the problem, and
  2. one possible solution, based on slicing the array into shorter pieces and doing lots of MPI Reduce() calls, rather than just one (controlled with the use_slices boolean)

With case=0 and use_slices=False, the error can be seen (array length 506)

With case=1 and use_slices=False, the error vanishes (array length 505)

With use_slices=True, the error vanishes, regardless of case, and even if case is set to a very long array (case=2)


Example Code

import mpi4py, mpi4py.MPI
import numpy as np

###### CASE FLAGS ########
# Whether or not to break the array into 200-element pieces
# before calling MPI Reduce()
use_slices = False

# The total length of the array to be reduced:
case = 0
if case == 0:
    array_length= 506
elif case == 1:
    array_length= 505
elif case == 2:
    array_length= 1000000

comm = mpi4py.MPI.COMM_WORLD
rank = comm.Get_rank()
nprocs = comm.Get_size()


array_to_reduce = np.ones(array_length)*(rank+1)  #just some different numbers per rank
reduced_array = np.zeros(array_length)

if not use_slices:
    comm.Reduce(array_to_reduce,
                reduced_array,
                op = mpi4py.MPI.SUM,
                root = 0)

    if rank==0:
        print(reduced_array)
else:  # in this case, use_slices is True
    array_slice_length = 200
    sliced_array = np.array_split(array_to_reduce, range(200, array_length, 200))

    reduced_array_using_slices = np.array([])
    for array_slice in sliced_array:
        returnedval = np.zeros(shape=array_slice.shape)
        comm.Reduce(array_slice,
                    returnedval,
                    op = mpi4py.MPI.SUM,
                    root = 0)
        reduced_array_using_slices=np.concatenate((reduced_array_using_slices, returnedval))
        comm.Barrier()

    if rank==0:
        print(reduced_array_using_slices)

Library Versions

Compiled from source - openmpi 3.1.4 mpi4py 3.0.3

like image 938
carthurs Avatar asked Oct 16 '22 03:10

carthurs


1 Answers

This is not a problem with mpi4py per se. The issue comes from the Cross-Memory Attach (CMA) system calls process_vm_readv() and process_vm_writev() that the shared-memory BTLs (Byte Transfer Layers, a.k.a. the things that move bytes between ranks) of Open MPI use to accelerate shared-memory communication between ranks that run on the same node by avoiding copying the data twice to and from a shared-memory buffer. This mechanism involves some setup overhead and is therefore only used for larger messages, which is why the problem only starts occurring after the messages size crosses the eager threshold.

CMA is part of the ptrace family of kernel services. Docker uses seccomp to limit what system calls can be made by processes running inside the container. The default profile has the following:

    {
        "names": [
            "kcmp",
            "process_vm_readv",
            "process_vm_writev",
            "ptrace"
        ],
        "action": "SCMP_ACT_ALLOW",
        "args": [],
        "comment": "",
        "includes": {
            "caps": [
                "CAP_SYS_PTRACE"
            ]
        },
        "excludes": {}
    },

limiting ptrace-related syscalls to containers that have the CAP_SYS_PTRACE capability, which is not among the capabilities granted by default. Therefore, to enable the normal functioning of Open MPI in Docker, one needs to grant the required capability by calling docker run with the following additional option:

--cap-add=SYS_PTRACE

This will allow Open MPI to function properly, but enabling ptrace may present a security risk in certain container deployments. Therefore, an alternative is to disable the use of CMA by Open MPI. This is achieved by setting an MCA parameter depending on the version of Open MPI and the shared-memory BTL used:

  • for the sm BTL (default before Open MPI 1.8): --mca btl_sm_use_cma 0
  • for the vader BTL (default since Open MPI 1.8): --mca btl_vader_single_copy_mechanism none

Disabling the single-copy mechanism will force the BTL to use pipelined copy through a shared-memory buffer, which may or may not affect the run time of the MPI job.

Read here about the shared-memory BTLs and the zero(single?)-copy mechanisms in Open MPI.

like image 171
Hristo Iliev Avatar answered Oct 21 '22 07:10

Hristo Iliev