I have been looking at the memory usage of some C++ REST API frameworks in Windows and Linux (Debian). In particular, I have looked at these two frameworks: cpprestsdk and cpp-httplib. In both, a thread pool is created and used to service requests.
I took the thread pool implementation from cpp-httplib and put it in a minimal working example below, to show the memory usage that I am observing on Windows and Linux.
#include <cassert>
#include <condition_variable>
#include <functional>
#include <iostream>
#include <list>
#include <map>
#include <memory>
#include <mutex>
#include <string>
#include <thread>
#include <vector>
using namespace std;
// TaskQueue and ThreadPool taken from https://github.com/yhirose/cpp-httplib
class TaskQueue {
public:
TaskQueue() = default;
virtual ~TaskQueue() = default;
virtual void enqueue(std::function<void()> fn) = 0;
virtual void shutdown() = 0;
virtual void on_idle() {};
};
class ThreadPool : public TaskQueue {
public:
explicit ThreadPool(size_t n) : shutdown_(false) {
while (n) {
threads_.emplace_back(worker(*this));
cout << "Thread number " << threads_.size() + 1 << " has ID " << threads_.back().get_id() << endl;
n--;
}
}
ThreadPool(const ThreadPool&) = delete;
~ThreadPool() override = default;
void enqueue(std::function<void()> fn) override {
std::unique_lock<std::mutex> lock(mutex_);
jobs_.push_back(fn);
cond_.notify_one();
}
void shutdown() override {
// Stop all worker threads...
{
std::unique_lock<std::mutex> lock(mutex_);
shutdown_ = true;
}
cond_.notify_all();
// Join...
for (auto& t : threads_) {
t.join();
}
}
private:
struct worker {
explicit worker(ThreadPool& pool) : pool_(pool) {}
void operator()() {
for (;;) {
std::function<void()> fn;
{
std::unique_lock<std::mutex> lock(pool_.mutex_);
pool_.cond_.wait(
lock, [&] { return !pool_.jobs_.empty() || pool_.shutdown_; });
if (pool_.shutdown_ && pool_.jobs_.empty()) { break; }
fn = pool_.jobs_.front();
pool_.jobs_.pop_front();
}
assert(true == static_cast<bool>(fn));
fn();
}
}
ThreadPool& pool_;
};
friend struct worker;
std::vector<std::thread> threads_;
std::list<std::function<void()>> jobs_;
bool shutdown_;
std::condition_variable cond_;
std::mutex mutex_;
};
// MWE
class ContainerWrapper {
public:
~ContainerWrapper() {
cout << "Destructor: data map is of size " << data.size() << endl;
}
map<pair<string, string>, double> data;
};
void handle_post() {
cout << "Start adding data, thread ID: " << std::this_thread::get_id() << endl;
ContainerWrapper cw;
for (size_t i = 0; i < 5000; ++i) {
string date = "2020-08-11";
string id = "xxxxx_" + std::to_string(i);
double value = 1.5;
cw.data[make_pair(date, id)] = value;
}
cout << "Data map is now of size " << cw.data.size() << endl;
unsigned pause = 3;
cout << "Sleep for " << pause << " seconds." << endl;
std::this_thread::sleep_for(std::chrono::seconds(pause));
}
int main(int argc, char* argv[]) {
cout << "ID of main thread: " << std::this_thread::get_id() << endl;
std::unique_ptr<TaskQueue> task_queue(new ThreadPool(40));
for (size_t i = 0; i < 50; ++i) {
cout << "Add task number: " << i + 1 << endl;
task_queue->enqueue([]() { handle_post(); });
// Sleep enough time for the task to finish.
std::this_thread::sleep_for(std::chrono::seconds(5));
}
task_queue->shutdown();
return 0;
}
When I run this MWE and look at the memory consumption in Windows vs Linux, I get the graph below. For Windows, I used perfmon
to get the Private Bytes value. In Linux, I used docker stats --no-stream --format "{{.MemUsage}}
to log the container's memory usage. This was in line with res
for the process from top
running inside the container. It appears from the graph that when a thread allocates memory for the map
variable in Windows in the handle_post
function, that the memory is given back when the function exits before the next call to the function. This was the type of behaviour that I was naively expecting. I have no experience regarding how the OS deals with memory allocated by a function that is being executed in a thread when the thread stays alive i.e. like here in a thread pool. On Linux, it looks like the memory usage keeps growing and that memory is not given back when the function exits. When all 40 threads have been used, and there are 10 more tasks to process, the memory usage appears to stop growing. Can somebody give a high level view of what is happening here in Linux from a memory management point of view or even some pointers about where to look for some background info on this specific topic?
Edit 1: I have edited the graph below to show the output value of rss
from running ps -p <pid> -h -o etimes,pid,rss,vsz
every second in the Linux container where <pid>
is the id of the process being tested. It is in reasonable agreement with the output of docker stats --no-stream --format "{{.MemUsage}}
.
Edit 2: Based on a comment below regarding STL allocators, I removed the map from MWE by replacing the handle_post
function with the following and adding the includes #include <cstdlib>
and #include <cstring>
. Now, the handle_post
function just allocates and sets memory for 500K int
s which is approximately 2MiB.
void handle_post() {
size_t chunk = 500000 * sizeof(int);
if (int* p = (int*)malloc(chunk)) {
memset(p, 1, chunk);
cout << "Allocated and used " << chunk << " bytes, thread ID: " << this_thread::get_id() << endl;
cout << "Memory address: " << p << endl;
unsigned pause = 3;
cout << "Sleep for " << pause << " seconds." << endl;
this_thread::sleep_for(chrono::seconds(pause));
free(p);
}
}
I get the same behaviour here. I reduced the number of threads to 8 and the number of tasks to 10 in the example. The graph below shows the results.
Edit 3: I have added the results from running on a Linux CentOS machine. It broadly agrees with the results from the Debian docker image result.
Edit 4: Based on another comment below, I ran the example under valgrind
's massif
tool. The massif
command line parameters are in the images below. I ran it with --pages-as-heap=yes
, second image below, and without this flag, first image below. The first image would suggest that ~2MiB memory is allocated to the (shared) heap as the handle_post
function is executed on a thread and then freed as the function exits. This is what I would expect and what I observe on Windows. I am not sure how to interpret the graph with --pages-as-heap=yes
yet, i.e. the second image.
I can't reconcile the output of massif
in the first image with the value of rss
from the ps
command shown in the graphs above. If I run the Docker image and limit the container memory to 12MB using docker run --rm -it --privileged --memory="12m" --memory-swap="12m" --name=mwe_test cpp_testing:1.0
, the container runs out of memory on the 7th allocation and is killed by the OS. I get Killed
in the output and when I look at dmesg
, I see Killed process 25709 (cpp_testing) total-vm:529960kB, anon-rss:10268kB, file-rss:2904kB, shmem-rss:0kB
. This would suggest that the rss
value from ps
is accurately reflecting the (heap) memory actually being used by the process whereas the massif
tool is calculating what it should be based on malloc
/new
and free
/delete
calls. This is just my basic assumption from this test. My question would still stand i.e. why is, or does it appear that, the heap memory is not being freed or deallocated when the handle_post
function exits?
Edit 5: I have added below a graph of the memory usage as you increase the number of threads in the thread pool from 1 to 4. The pattern continues as you increase the number of threads up to 10 so I have not included 5 to 10. Note that I have added a 5 sec pause at the start of main
which is the initial flat line in the graph for the first ~5secs. It appears that, regardless of thread count, there is a release of memory after the first task is processed but that memory is not released (kept for reuse?) after task 2 through 10. It may suggest that some memory allocation parameter is tuned during task 1 execution (just thinking out loud!)?
Edit 6: Based on the suggestion from the detailed answer below, I set the environment variable MALLOC_ARENA_MAX
to 1 and 2 before running the example. This gives the output in the following graph. This is as expected based on the explanation of the effect of this variable given in the answer.
Many modern allocators, including the one in glibc 2.17 that you are using, use multiple arenas (a structure which tracks free memory regions) in order to avoid contention between threads which want to allocate at the same time.
Memory freed back to one arena isn't available to be allocated by another arena (unless some type of cross-arena transfer is triggered).
By default, glibc will allocate new arenas every time a new thread makes an allocation, until a predefined limit is hit (which defaults to 8 * number of CPUs) as you can see by examining the code.
One consequence of this is that memory allocated then freed on a thread may not be available to other threads since they are using separate areas, even if that thread isn't doing any useful work.
You can try setting the glibc malloc tunable glibc.malloc.arena_max
to 1
in order to force all threads to the same arena and see if it changes the behavior you were observing.
Note that this has everything to do with the userspace allocator (in libc) and nothing to do with the OS allocation of memory: the OS is never informed that the memory has been freed. Even if you force a single arena, it doesn't mean that the userspace allocator will decide to inform the OS: it may simply keep the memory around to satisfy a future request (there are tunables to adjust this behavior also).
However, in your test using a single arena should be enough to prevent the constantly increasing memory footprint since the memory is freed before the next thread starts, and so we expect it to be reused by the next task, which starts on a different thread.
Finally, it is worth pointing out that what happens is highly dependent on exactly how threads are notified by the condition variable: presumably Linux uses a FIFO behavior, where the most recently queued (waiting) thread will be the last to be notified. This causes you to cycle through all the threads as you add tasks, causing many arenas to be created. A more efficient pattern (for a variety of reasons) is a LIFO policy: use the most recently enqueued thread for the next job. This would cause the same thread to be repeatedly reused in your test and "solve" the problem.
Final note: many allocators, but not the on in the older version of glibc that you are using, also implement a per-thread cache which allows the allocation fast path to proceed without any atomic operations. This can produce a similar effect to the use of multiple arenas, and which keeps scaling with the number of threads.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With