Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ uses twice the memory when moving elements from one dequeue to another

In my project, I use pybind11 to bind C++ code to Python. Recently I have had to deal with very large data sets (70GB+) and encountered need to split data from one std::deque between multiple std::deque's. Since my dataset is so large, I expect the split not to have much of memory overhead. Therefore I went for one pop - one push strategy, which in general should ensure that my requirements are met.

That is all in theory. In practice, my process got killed. So I struggled for past two days and eventually came up with following minimal example demonstrating the problem.

Generally the minimal example creates bunch of data in deque (~11GB), returns it to Python, then calls again to C++ to move the elements. Simple as that. Moving part is done in executor.

The interesting thing is, that if I don't use executor, memory usage is as expected and also when limits on virtual memory by ulimit are imposed, the program really respects these limits and doesn't crash.

test.py

from test import _test
import asyncio
import concurrent

async def test_main(loop, executor):
    numbers = _test.generate()
    # moved_numbers = _test.move(numbers) # This works!
    moved_numbers = await loop.run_in_executor(executor, _test.move, numbers) # This doesn't!

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    executor = concurrent.futures.ThreadPoolExecutor(1)

    task = loop.create_task(test_main(loop, executor))
    loop.run_until_complete(task)

    executor.shutdown()
    loop.close()

test.cpp

#include <deque>
#include <iostream>
#include <pybind11/pybind11.h>
#include <pybind11/stl.h>

namespace py = pybind11;

PYBIND11_MAKE_OPAQUE(std::deque<uint64_t>);
PYBIND11_DECLARE_HOLDER_TYPE(T, std::shared_ptr<T>);

template<class T>
void py_bind_opaque_deque(py::module& m, const char* type_name) {
    py::class_<std::deque<T>, std::shared_ptr<std::deque<T>>>(m, type_name)
    .def(py::init<>())
    .def(py::init<size_t, T>());
}

PYBIND11_PLUGIN(_test) {
    namespace py = pybind11;
    pybind11::module m("_test");
    py_bind_opaque_deque<uint64_t>(m, "NumbersDequeue");

    // Generate ~11Gb of data.
    m.def("generate", []() {
        std::deque<uint64_t> numbers;
        for (uint64_t i = 0; i < 1500 * 1000000; ++i) {
            numbers.push_back(i);
        }
        return numbers;
    });

    // Move data from one dequeue to another.
    m.def("move", [](std::deque<uint64_t>& numbers) {
        std::deque<uint64_t> numbers_moved;

        while (!numbers.empty()) {
            numbers_moved.push_back(std::move(numbers.back()));
            numbers.pop_back();
        }
        std::cout << "Done!\n";
        return numbers_moved;
    });

    return m.ptr();
}

test/__init__.py

import warnings
warnings.simplefilter("default")

Compilation:

g++ -std=c++14 -O2 -march=native -fPIC -Iextern/pybind11 `python3.5-config --includes` `python3.5-config --ldflags` `python3.5-config --libs` -shared -o test/_test.so test.cpp

Observations:

  • When the moving part is not done by executor, so we just call moved_numbers = _test.move(numbers), all works as expected, memory usage showed by htop stays around 11Gb, great!.
  • When moving part is done in executor, the program takes double the memory and crashes.
  • When limits on virtual memory are introduced (~15Gb), all works fine, which is probably the most interesting part.

    ulimit -Sv 15000000 && python3.5 test.py >> Done!.

  • When we increase the limit the program crashes (150Gb > my RAM).

    ulimit -Sv 150000000 && python3.5 test.py >> [1] 2573 killed python3.5 test.py

  • Usage of deque method shrink_to_fit doesn't help (And nor it should)

Used software

Ubuntu 14.04
gcc version 5.4.1 20160904 (Ubuntu 5.4.1-2ubuntu1~14.04)
Python 3.5.2
pybind11 latest release - v1.8.1

Note

Please note that this example was made merely to demonstrate the problem. Usage of asyncio and pybind is necessary for problem to occur.

Any ideas on what might be going on are most welcomed.

like image 449
Jendas Avatar asked Jan 21 '26 17:01

Jendas


1 Answers

The problem turned out to be caused by Data being created in one thread and then deallocated in another one. It is so because of malloc arenas in glibc (for reference see this). It can be nicely demonstrated by doing:

executor1 = concurrent.futures.ThreadPoolExecutor(1)
executor2 = concurrent.futures.ThreadPoolExecutor(1)

numbers = await loop.run_in_executor(executor1, _test.generate)
moved_numbers = await loop.run_in_executor(executor2, _test.move, numbers)

which would take twice the memory allocated by _test.generate and

executor = concurrent.futures.ThreadPoolExecutor(1)

numbers = await loop.run_in_executor(executor, _test.generate)
moved_numbers = await loop.run_in_executor(executor, _test.move, numbers)

which wound't.

This issue can be solved either by rewriting the code so it doesn't move the elements from one container to another (my case) or by setting environment variable export MALLOC_ARENA_MAX=1 which will limit number of malloc arenas to 1. This however might have some performance implications involved (There is a good reason for having multiple arenas).

like image 61
Jendas Avatar answered Jan 24 '26 06:01

Jendas



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!