In my project, I use pybind11 to bind C++ code to Python. Recently I have had to deal with very large data sets (70GB+) and encountered need to split data from one std::deque between multiple std::deque's. Since my dataset is so large, I expect the split not to have much of memory overhead. Therefore I went for one pop - one push strategy, which in general should ensure that my requirements are met.
That is all in theory. In practice, my process got killed. So I struggled for past two days and eventually came up with following minimal example demonstrating the problem.
Generally the minimal example creates bunch of data in deque (~11GB), returns it to Python, then calls again to C++ to move the elements. Simple as that. Moving part is done in executor.
The interesting thing is, that if I don't use executor, memory usage is as expected and also when limits on virtual memory by ulimit are imposed, the program really respects these limits and doesn't crash.
test.py
from test import _test
import asyncio
import concurrent
async def test_main(loop, executor):
numbers = _test.generate()
# moved_numbers = _test.move(numbers) # This works!
moved_numbers = await loop.run_in_executor(executor, _test.move, numbers) # This doesn't!
if __name__ == '__main__':
loop = asyncio.get_event_loop()
executor = concurrent.futures.ThreadPoolExecutor(1)
task = loop.create_task(test_main(loop, executor))
loop.run_until_complete(task)
executor.shutdown()
loop.close()
test.cpp
#include <deque>
#include <iostream>
#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
namespace py = pybind11;
PYBIND11_MAKE_OPAQUE(std::deque<uint64_t>);
PYBIND11_DECLARE_HOLDER_TYPE(T, std::shared_ptr<T>);
template<class T>
void py_bind_opaque_deque(py::module& m, const char* type_name) {
py::class_<std::deque<T>, std::shared_ptr<std::deque<T>>>(m, type_name)
.def(py::init<>())
.def(py::init<size_t, T>());
}
PYBIND11_PLUGIN(_test) {
namespace py = pybind11;
pybind11::module m("_test");
py_bind_opaque_deque<uint64_t>(m, "NumbersDequeue");
// Generate ~11Gb of data.
m.def("generate", []() {
std::deque<uint64_t> numbers;
for (uint64_t i = 0; i < 1500 * 1000000; ++i) {
numbers.push_back(i);
}
return numbers;
});
// Move data from one dequeue to another.
m.def("move", [](std::deque<uint64_t>& numbers) {
std::deque<uint64_t> numbers_moved;
while (!numbers.empty()) {
numbers_moved.push_back(std::move(numbers.back()));
numbers.pop_back();
}
std::cout << "Done!\n";
return numbers_moved;
});
return m.ptr();
}
test/__init__.py
import warnings
warnings.simplefilter("default")
Compilation:
g++ -std=c++14 -O2 -march=native -fPIC -Iextern/pybind11 `python3.5-config --includes` `python3.5-config --ldflags` `python3.5-config --libs` -shared -o test/_test.so test.cpp
Observations:
moved_numbers = _test.move(numbers), all works as expected, memory usage showed by htop stays around 11Gb, great!.When limits on virtual memory are introduced (~15Gb), all works fine, which is probably the most interesting part.
ulimit -Sv 15000000 && python3.5 test.py >> Done!.
When we increase the limit the program crashes (150Gb > my RAM).
ulimit -Sv 150000000 && python3.5 test.py >> [1] 2573 killed python3.5 test.py
Usage of deque method shrink_to_fit doesn't help (And nor it should)
Used software
Ubuntu 14.04
gcc version 5.4.1 20160904 (Ubuntu 5.4.1-2ubuntu1~14.04)
Python 3.5.2
pybind11 latest release - v1.8.1
Note
Please note that this example was made merely to demonstrate the problem. Usage of asyncio and pybind is necessary for problem to occur.
Any ideas on what might be going on are most welcomed.
The problem turned out to be caused by Data being created in one thread and then deallocated in another one. It is so because of malloc arenas in glibc (for reference see this). It can be nicely demonstrated by doing:
executor1 = concurrent.futures.ThreadPoolExecutor(1)
executor2 = concurrent.futures.ThreadPoolExecutor(1)
numbers = await loop.run_in_executor(executor1, _test.generate)
moved_numbers = await loop.run_in_executor(executor2, _test.move, numbers)
which would take twice the memory allocated by _test.generate and
executor = concurrent.futures.ThreadPoolExecutor(1)
numbers = await loop.run_in_executor(executor, _test.generate)
moved_numbers = await loop.run_in_executor(executor, _test.move, numbers)
which wound't.
This issue can be solved either by rewriting the code so it doesn't move the elements from one container to another (my case) or by setting environment variable export MALLOC_ARENA_MAX=1 which will limit number of malloc arenas to 1. This however might have some performance implications involved (There is a good reason for having multiple arenas).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With