I've read the documentation about this parameter, but the difference is really huge! When enabled, the memory usage of a simple program (see below) is about 7 GB and when it's disabled, the reported usage is about 160 KB.
top
also shows about 7 GB, which kinda confirms the result with pages-as-heap=yes
.
(I have a theory, but I don't believe it would explain such huge difference, so - asking for some help).
What especially bothers me, is that most of the reported memory usage is used by std::string
, while what?
is never printed (meaning - the actual capacity is pretty small).
I do need to use pages-as-heap=yes
while profiling my app, I just wonder how to avoid the "false positives"
The code snippet:
#include <iostream>
#include <thread>
#include <vector>
#include <chrono>
void run()
{
while (true)
{
std::string s;
s += "aaaaa";
s += "aaaaaaaaaaaaaaa";
s += "bbbbbbbbbb";
s += "cccccccccccccccccccccccccccccccccccccccccccccccccc";
if (s.capacity() > 1024) std::cout << "what?" << std::endl;
std::this_thread::sleep_for(std::chrono::seconds(1));
}
}
int main()
{
std::vector<std::thread> workers;
for( unsigned i = 0; i < 192; ++i ) workers.push_back(std::thread(&run));
workers.back().join();
}
Compiled with: g++ --std=c++11 -fno-inline -g3 -pthread
With pages-as-heap=yes
:
100.00% (7,257,714,688B) (page allocation syscalls) mmap/mremap/brk, --alloc-fns, etc.
->99.75% (7,239,757,824B) 0x54E0679: mmap (mmap.c:34)
| ->53.63% (3,892,314,112B) 0x545C3CF: new_heap (arena.c:438)
| | ->53.63% (3,892,314,112B) 0x545CC1F: arena_get2.part.3 (arena.c:646)
| | ->53.63% (3,892,314,112B) 0x5463248: malloc (malloc.c:2911)
| | ->53.63% (3,892,314,112B) 0x4CB7E76: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->53.63% (3,892,314,112B) 0x4CF8E37: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->53.63% (3,892,314,112B) 0x4CF9C69: std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->53.63% (3,892,314,112B) 0x4CF9D22: std::string::reserve(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->53.63% (3,892,314,112B) 0x4CF9FB1: std::string::append(char const*, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->53.63% (3,892,314,112B) 0x401252: run() (test.cpp:11)
| | ->53.63% (3,892,314,112B) 0x403929: void std::_Bind_simple<void (*())()>::_M_invoke<>(std::_Index_tuple<>) (functional:1700)
| | ->53.63% (3,892,314,112B) 0x403864: std::_Bind_simple<void (*())()>::operator()() (functional:1688)
| | ->53.63% (3,892,314,112B) 0x4037D2: std::thread::_Impl<std::_Bind_simple<void (*())()> >::_M_run() (thread:115)
| | ->53.63% (3,892,314,112B) 0x4CE2C7E: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->53.63% (3,892,314,112B) 0x51C96B8: start_thread (pthread_create.c:333)
| | ->53.63% (3,892,314,112B) 0x54E63DB: clone (clone.S:109)
| |
| ->35.14% (2,550,136,832B) 0x545C35B: new_heap (arena.c:427)
| | ->35.14% (2,550,136,832B) 0x545CC1F: arena_get2.part.3 (arena.c:646)
| | ->35.14% (2,550,136,832B) 0x5463248: malloc (malloc.c:2911)
| | ->35.14% (2,550,136,832B) 0x4CB7E76: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->35.14% (2,550,136,832B) 0x4CF8E37: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->35.14% (2,550,136,832B) 0x4CF9C69: std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->35.14% (2,550,136,832B) 0x4CF9D22: std::string::reserve(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->35.14% (2,550,136,832B) 0x4CF9FB1: std::string::append(char const*, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->35.14% (2,550,136,832B) 0x401252: run() (test.cpp:11)
| | ->35.14% (2,550,136,832B) 0x403929: void std::_Bind_simple<void (*())()>::_M_invoke<>(std::_Index_tuple<>) (functional:1700)
| | ->35.14% (2,550,136,832B) 0x403864: std::_Bind_simple<void (*())()>::operator()() (functional:1688)
| | ->35.14% (2,550,136,832B) 0x4037D2: std::thread::_Impl<std::_Bind_simple<void (*())()> >::_M_run() (thread:115)
| | ->35.14% (2,550,136,832B) 0x4CE2C7E: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->35.14% (2,550,136,832B) 0x51C96B8: start_thread (pthread_create.c:333)
| | ->35.14% (2,550,136,832B) 0x54E63DB: clone (clone.S:109)
| |
| ->10.99% (797,306,880B) 0x51CA1D4: pthread_create@@GLIBC_2.2.5 (allocatestack.c:513)
| ->10.99% (797,306,880B) 0x4CE2DC1: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>, void (*)()) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->10.99% (797,306,880B) 0x4CE2ECB: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->10.99% (797,306,880B) 0x401BEA: std::thread::thread<void (*)()>(void (*&&)()) (thread:138)
| ->10.99% (797,306,880B) 0x401353: main (test.cpp:24)
|
->00.25% (17,956,864B) in 1+ places, all below ms_print's threshold (01.00%)
while with pages-as-heap=no
:
96.38% (159,289B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->43.99% (72,704B) 0x4EBAEFE: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->43.99% (72,704B) 0x40106B8: call_init.part.0 (dl-init.c:72)
| ->43.99% (72,704B) 0x40107C9: _dl_init (dl-init.c:30)
| ->43.99% (72,704B) 0x4000C68: ??? (in /lib/x86_64-linux-gnu/ld-2.23.so)
|
->33.46% (55,296B) 0x40138A3: _dl_allocate_tls (dl-tls.c:322)
| ->33.46% (55,296B) 0x53D126D: pthread_create@@GLIBC_2.2.5 (allocatestack.c:588)
| ->33.46% (55,296B) 0x4EE9DC1: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>, void (*)()) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->33.46% (55,296B) 0x4EE9ECB: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->33.46% (55,296B) 0x401BEA: std::thread::thread<void (*)()>(void (*&&)()) (thread:138)
| ->33.46% (55,296B) 0x401353: main (test.cpp:24)
|
->12.12% (20,025B) 0x4EFFE37: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->12.12% (20,025B) 0x4F00C69: std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->12.12% (20,025B) 0x4F00D22: std::string::reserve(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->12.12% (20,025B) 0x4F00FB1: std::string::append(char const*, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->12.07% (19,950B) 0x401285: run() (test.cpp:14)
| | ->12.07% (19,950B) 0x403929: void std::_Bind_simple<void (*())()>::_M_invoke<>(std::_Index_tuple<>) (functional:1700)
| | ->12.07% (19,950B) 0x403864: std::_Bind_simple<void (*())()>::operator()() (functional:1688)
| | ->12.07% (19,950B) 0x4037D2: std::thread::_Impl<std::_Bind_simple<void (*())()> >::_M_run() (thread:115)
| | ->12.07% (19,950B) 0x4EE9C7E: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->12.07% (19,950B) 0x53D06B8: start_thread (pthread_create.c:333)
| | ->12.07% (19,950B) 0x56ED3DB: clone (clone.S:109)
| |
| ->00.05% (75B) in 1+ places, all below ms_print's threshold (01.00%)
|
->05.58% (9,216B) 0x40315B: __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, (__gnu_cxx::_Lock_policy)2> >::allocate(unsigned long, void const*) (new_allocator.h:104)
| ->05.58% (9,216B) 0x402FC2: std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, (__gnu_cxx::_Lock_policy)2> > >::allocate(std::allocator<std::_Sp_counted_ptr_inplace<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, (__gnu_cxx::_Lock_policy)2> >&, unsigned long) (alloc_traits.h:488)
| ->05.58% (9,216B) 0x402D4B: std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, std::_Bind_simple<void (*())()> >(std::_Sp_make_shared_tag, std::thread::_Impl<std::_Bind_simple<void (*())()> >*, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > > const&, std::_Bind_simple<void (*())()>&&) (shared_ptr_base.h:616)
| ->05.58% (9,216B) 0x402BDE: std::__shared_ptr<std::thread::_Impl<std::_Bind_simple<void (*())()> >, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, std::_Bind_simple<void (*())()> >(std::_Sp_make_shared_tag, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > > const&, std::_Bind_simple<void (*())()>&&) (shared_ptr_base.h:1090)
| ->05.58% (9,216B) 0x402A76: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<void (*())()> > >::shared_ptr<std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, std::_Bind_simple<void (*())()> >(std::_Sp_make_shared_tag, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > > const&, std::_Bind_simple<void (*())()>&&) (shared_ptr.h:316)
| ->05.58% (9,216B) 0x402771: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<void (*())()> > > std::allocate_shared<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, std::_Bind_simple<void (*())()> >(std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > > const&, std::_Bind_simple<void (*())()>&&) (shared_ptr.h:594)
| ->05.58% (9,216B) 0x402325: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<void (*())()> > > std::make_shared<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::_Bind_simple<void (*())()> >(std::_Bind_simple<void (*())()>&&) (shared_ptr.h:610)
| ->05.58% (9,216B) 0x401F9C: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<void (*())()> > > std::thread::_M_make_routine<std::_Bind_simple<void (*())()> >(std::_Bind_simple<void (*())()>&&) (thread:196)
| ->05.58% (9,216B) 0x401BC4: std::thread::thread<void (*)()>(void (*&&)()) (thread:138)
| ->05.58% (9,216B) 0x401353: main (test.cpp:24)
|
->01.24% (2,048B) 0x402C9A: __gnu_cxx::new_allocator<std::thread>::allocate(unsigned long, void const*) (new_allocator.h:104)
->01.24% (2,048B) 0x402AF5: std::allocator_traits<std::allocator<std::thread> >::allocate(std::allocator<std::thread>&, unsigned long) (alloc_traits.h:488)
->01.24% (2,048B) 0x402928: std::_Vector_base<std::thread, std::allocator<std::thread> >::_M_allocate(unsigned long) (stl_vector.h:170)
->01.24% (2,048B) 0x40244E: void std::vector<std::thread, std::allocator<std::thread> >::_M_emplace_back_aux<std::thread>(std::thread&&) (vector.tcc:412)
->01.24% (2,048B) 0x40206D: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<std::thread>(std::thread&&) (vector.tcc:101)
->01.24% (2,048B) 0x401C82: std::vector<std::thread, std::allocator<std::thread> >::push_back(std::thread&&) (stl_vector.h:932)
->01.24% (2,048B) 0x401366: main (test.cpp:24)
Please ignore the crappy handling of the threads, it's just a very short example.
It appears, that this is not related to std::string
at all. As @Lawrence suggested, this can be reproduced by simply allocating a single int
on the heap (with new
). I believe @Lawrence is very close to the real answer here, quoting his comments (easier for further readers):
Lawrence:
@KirilKirov The string allocation is not actually taking that much space... Each thread gets it's initial stack and then heap access maps some large amount of space (around 70m) that gets inaccurately reflected. You can measure it by just declaring 1 string and then having a spin loop... the same virtual memory usage is shown – Lawrence Sep 28 at 14:51
me:
@Lawrence - you're damn right! OK, so, you're saying (and it appears to be like this), that on each thread, on the first heap allocation, the memory manager (or the OS, or whatever) dedicates huge chunk of memory for the threads' heap needs? And this chunk will be reused later (or shrinked, if necessary)? – Kiril Kirov Sep 28 at 15:45
Lawrence:
@KirilKirov something of that nature... exact allocations probably depends on malloc implementation and whatnot – Lawrence 2 days ago
In short, Massif collates the stack trace of every single allocation point in the program into a single tree, which gives a complete picture at a particular point in time of how and why all heap memory was allocated. Note that the tree entries correspond not to functions, but to individual code locations.
Valgrind works by doing a just-in-time (JIT) translation of the input program into an equivalent version that has additional checking. For the memcheck tool, this means it literally looks at the x86 code in the executable, and detects what instructions represent memory accesses.
Valgrind (/ˈvælɡrɪnd/) is a programming tool for memory debugging, memory leak detection, and profiling. Valgrind. Original author(s) Julian Seward. Developer(s)
massif
with --pages-as-heap=yes
and the top
column you are observing both measure the virtual memory used by a process. This virtual memory includes all space mmap
'd in the implementation of malloc and during the creation of threads. For example, the default stack size for a thread will be 8192k
which is reflected in the creation of each thread and contributes to the virtual memory footprint.
The specific allocation scheme will be dependent on implementation but it seems that the first heap allocation on a new thread will mmap
roughly 65 megabytes of space. This can be viewed by looking at the pmap output for a process.
Excerpt from a very similar program to the example:
75170: ./a.out
0000000000400000 24K r-x-- a.out
0000000000605000 4K r---- a.out
0000000000606000 4K rw--- a.out
0000000001b6a000 200K rw--- [ anon ]
00007f669dfa4000 4K ----- [ anon ]
00007f669dfa5000 8192K rw--- [ anon ]
00007f669e7a5000 4K ----- [ anon ]
00007f669e7a6000 8192K rw--- [ anon ]
00007f669efa6000 4K ----- [ anon ]
00007f669efa7000 8192K rw--- [ anon ]
...
00007f66cb800000 8192K rw--- [ anon ]
00007f66cc000000 132K rw--- [ anon ]
00007f66cc021000 65404K ----- [ anon ]
00007f66d0000000 132K rw--- [ anon ]
00007f66d0021000 65404K ----- [ anon ]
00007f66d4000000 132K rw--- [ anon ]
00007f66d4021000 65404K ----- [ anon ]
...
00007f6880586000 8192K rw--- [ anon ]
00007f6880d86000 1056K r-x-- libm-2.23.so
00007f6880e8e000 2044K ----- libm-2.23.so
...
00007f6881c08000 4K r---- libpthread-2.23.so
00007f6881c09000 4K rw--- libpthread-2.23.so
00007f6881c0a000 16K rw--- [ anon ]
00007f6881c0e000 152K r-x-- ld-2.23.so
00007f6881e09000 24K rw--- [ anon ]
00007f6881e33000 4K r---- ld-2.23.so
00007f6881e34000 4K rw--- ld-2.23.so
00007f6881e35000 4K rw--- [ anon ]
00007ffe9d75b000 132K rw--- [ stack ]
00007ffe9d7f8000 12K r---- [ anon ]
00007ffe9d7fb000 8K r-x-- [ anon ]
ffffffffff600000 4K r-x-- [ anon ]
total 7815008K
It seems that malloc becomes more conservative as you approach some threshold of virtual memory per process. Also, my comment about libraries being mapped separately was misguided (they should be shared per process)
I'll try to write a short summary of what I learned, while trying to figure out what's happening.
Note: this answer is possible thanks to @Lawrence - appreciated!
This has absolutely nothing to do with Linux/kernel (virtual) memory management, nor with std::string
.
It's all about the glibc
's memory allocator - it just allocates huge areas of memory on the first (and not only, of course) dynamic allocation (per thread).
MCVE
#include <thread>
#include <vector>
#include <chrono>
int main() {
std::vector<std::thread> workers;
for( unsigned i = 0; i < 192; ++i )
workers.emplace_back([]{
const auto x = std::make_unique<int>(rand());
while (true) std::this_thread::sleep_for(std::chrono::seconds(1));});
workers.back().join();
}
Please ignore the crappy handling of the threads, I wanted this to be as short as possible.
Compile: g++ --std=c++14 -fno-inline -g3 -O0 -pthread test.cpp
.
Profile: valgrind --tool=massif --pages-as-heap=[no|yes] ./a.out
top
shows 7'815'012
KiB virtual memory.pmap
also shows 7'815'016
KiB virtual memory.
Similar result is shown by massif
with pages-as-heap=yes
: 7'817'088
KiB, see below.
On the other hand, massif
with pages-as-heap=no
is drastically different - around 133 KiB!
Memory usage before killing the program:
100.00% (8,004,698,112B) (page allocation syscalls) mmap/mremap/brk, --alloc-fns, etc.
->99.78% (7,986,741,248B) 0x54E0679: mmap (mmap.c:34)
| ->46.11% (3,690,987,520B) 0x545C3CF: new_heap (arena.c:438)
| | ->46.11% (3,690,987,520B) 0x545CC1F: arena_get2.part.3 (arena.c:646)
| | ->46.11% (3,690,987,520B) 0x5463248: malloc (malloc.c:2911)
| | ->46.11% (3,690,987,520B) 0x4CB7E76: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->46.11% (3,690,987,520B) 0x4026D0: std::_MakeUniq<int>::__single_object std::make_unique<int, int>(int&&) (unique_ptr.h:765)
| | ->46.11% (3,690,987,520B) 0x400EC5: main::{lambda()
| | ->46.11% (3,690,987,520B) 0x40225C: void std::_Bind_simple<main::{lambda()
| | ->46.11% (3,690,987,520B) 0x402194: std::_Bind_simple<main::{lambda()
| | ->46.11% (3,690,987,520B) 0x402102: std::thread::_Impl<std::_Bind_simple<main::{lambda()
| | ->46.11% (3,690,987,520B) 0x4CE2C7E: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->46.11% (3,690,987,520B) 0x51C96B8: start_thread (pthread_create.c:333)
| | ->46.11% (3,690,987,520B) 0x54E63DB: clone (clone.S:109)
| |
| ->33.53% (2,684,354,560B) 0x545C35B: new_heap (arena.c:427)
| | ->33.53% (2,684,354,560B) 0x545CC1F: arena_get2.part.3 (arena.c:646)
| | ->33.53% (2,684,354,560B) 0x5463248: malloc (malloc.c:2911)
| | ->33.53% (2,684,354,560B) 0x4CB7E76: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->33.53% (2,684,354,560B) 0x4026D0: std::_MakeUniq<int>::__single_object std::make_unique<int, int>(int&&) (unique_ptr.h:765)
| | ->33.53% (2,684,354,560B) 0x400EC5: main::{lambda()
| | ->33.53% (2,684,354,560B) 0x40225C: void std::_Bind_simple<main::{lambda()
| | ->33.53% (2,684,354,560B) 0x402194: std::_Bind_simple<main::{lambda()
| | ->33.53% (2,684,354,560B) 0x402102: std::thread::_Impl<std::_Bind_simple<main::{lambda()
| | ->33.53% (2,684,354,560B) 0x4CE2C7E: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->33.53% (2,684,354,560B) 0x51C96B8: start_thread (pthread_create.c:333)
| | ->33.53% (2,684,354,560B) 0x54E63DB: clone (clone.S:109)
| |
| ->20.13% (1,611,399,168B) 0x51CA1D4: pthread_create@@GLIBC_2.2.5 (allocatestack.c:513)
| ->20.13% (1,611,399,168B) 0x4CE2DC1: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>, void (*)()) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->20.13% (1,611,399,168B) 0x4CE2ECB: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->20.13% (1,611,399,168B) 0x40139A: std::thread::thread<main::{lambda()
| ->20.13% (1,611,399,168B) 0x4012AE: _ZN9__gnu_cxx13new_allocatorISt6threadE9constructIS1_IZ4mainEUlvE_EEEvPT_DpOT0_ (new_allocator.h:120)
| ->20.13% (1,611,399,168B) 0x401075: _ZNSt16allocator_traitsISaISt6threadEE9constructIS0_IZ4mainEUlvE_EEEvRS1_PT_DpOT0_ (alloc_traits.h:527)
| ->19.19% (1,535,864,832B) 0x401009: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<main::{lambda()
| | ->19.19% (1,535,864,832B) 0x400F47: main (test.cpp:10)
| |
| ->00.94% (75,534,336B) in 1+ places, all below ms_print's threshold (01.00%)
|
->00.22% (17,956,864B) in 1+ places, all below ms_print's threshold (01.00%)
Memory usage before killing the program:
--------------------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
--------------------------------------------------------------------------------
68 2,793,125 143,280 136,676 6,604 0
95.39% (136,676B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->50.74% (72,704B) 0x4EBAEFE: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->50.74% (72,704B) 0x40106B8: call_init.part.0 (dl-init.c:72)
| ->50.74% (72,704B) 0x40107C9: _dl_init (dl-init.c:30)
| ->50.74% (72,704B) 0x4000C68: ??? (in /lib/x86_64-linux-gnu/ld-2.23.so)
|
->36.58% (52,416B) 0x40138A3: _dl_allocate_tls (dl-tls.c:322)
| ->36.58% (52,416B) 0x53D126D: pthread_create@@GLIBC_2.2.5 (allocatestack.c:588)
| ->36.58% (52,416B) 0x4EE9DC1: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>, void (*)()) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->36.58% (52,416B) 0x4EE9ECB: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->36.58% (52,416B) 0x40139A: std::thread::thread<main::{lambda()
| ->36.58% (52,416B) 0x4012AE: _ZN9__gnu_cxx13new_allocatorISt6threadE9constructIS1_IZ4mainEUlvE_EEEvPT_DpOT0_ (new_allocator.h:120)
| ->36.58% (52,416B) 0x401075: _ZNSt16allocator_traitsISaISt6threadEE9constructIS0_IZ4mainEUlvE_EEEvRS1_PT_DpOT0_ (alloc_traits.h:527)
| ->34.77% (49,824B) 0x401009: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<main::{lambda()
| | ->34.77% (49,824B) 0x400F47: main (test.cpp:10)
| |
| ->01.81% (2,592B) 0x4010FF: void std::vector<std::thread, std::allocator<std::thread> >::_M_emplace_back_aux<main::{lambda()
| ->01.81% (2,592B) 0x40103D: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<main::{lambda()
| ->01.81% (2,592B) 0x400F47: main (test.cpp:10)
|
->06.13% (8,784B) 0x401B4B: __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<std::thread::_Impl<std::_Bind_simple<main::{lambda()
| ->06.13% (8,784B) 0x401A60: std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<std::thread::_Impl<std::_Bind_simple<main::{lambda()
| ->06.13% (8,784B) 0x40194D: std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<std::thread::_Impl<std::_Bind_simple<main::{lambda()
| ->06.13% (8,784B) 0x401894: std::__shared_ptr<std::thread::_Impl<std::_Bind_simple<main::{lambda()
| ->06.13% (8,784B) 0x40183A: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<main::{lambda()
| ->06.13% (8,784B) 0x4017C7: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<main::{lambda()
| ->06.13% (8,784B) 0x4016AB: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<main::{lambda()
| ->06.13% (8,784B) 0x40155E: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<main::{lambda()
| ->06.13% (8,784B) 0x401374: std::thread::thread<main::{lambda()
| ->06.13% (8,784B) 0x4012AE: _ZN9__gnu_cxx13new_allocatorISt6threadE9constructIS1_IZ4mainEUlvE_EEEvPT_DpOT0_ (new_allocator.h:120)
| ->06.13% (8,784B) 0x401075: _ZNSt16allocator_traitsISaISt6threadEE9constructIS0_IZ4mainEUlvE_EEEvRS1_PT_DpOT0_ (alloc_traits.h:527)
| ->05.83% (8,352B) 0x401009: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<main::{lambda()
| | ->05.83% (8,352B) 0x400F47: main (test.cpp:10)
| |
| ->00.30% (432B) in 1+ places, all below ms_print's threshold (01.00%)
|
->01.43% (2,048B) 0x403432: __gnu_cxx::new_allocator<std::thread>::allocate(unsigned long, void const*) (new_allocator.h:104)
| ->01.43% (2,048B) 0x4032CF: std::allocator_traits<std::allocator<std::thread> >::allocate(std::allocator<std::thread>&, unsigned long) (alloc_traits.h:488)
| ->01.43% (2,048B) 0x4030B8: std::_Vector_base<std::thread, std::allocator<std::thread> >::_M_allocate(unsigned long) (stl_vector.h:170)
| ->01.43% (2,048B) 0x4010B6: void std::vector<std::thread, std::allocator<std::thread> >::_M_emplace_back_aux<main::{lambda()
| ->01.43% (2,048B) 0x40103D: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<main::{lambda()
| ->01.43% (2,048B) 0x400F47: main (test.cpp:10)
|
->00.51% (724B) in 1+ places, all below ms_print's threshold (01.00%)
With pages-as-heap=no
the things look reasonable - let's not inspect it. As expected, everything ends up with malloc/new/new[]
and the memory usage is small enough not to worry us - these are the high level allocations.
But see pages-as-heap=yes
? ~8GiB virtual memory with this simple code?
Let's inspect the stack traces.
pthread_create
Let's start with the easier one: the one, that ends with pthread_create
.
massif
reports 1,611,399,168
bytes of allocated memory - this is exactly 192 * 8'196 KiB, meaning - 192 threads * 8MiB, which is the default max stack size of a thread in Linux.
Note, that 8'196 KiB is not exactly 8 MiB (8'192 KiB). I don't know where this difference comes from, but it's not significant at the moment.
std::make_unique<int>
OK, let's now see the other two stacks... wait, they are exactly the same? Yeah, massif
's documentation explains this, I didn't completely understand it, but it's also not significant. They show exactly the same stack. Let's combine the results and examine them together.
The memory usage from these two stacks combined is 6'375'342'080
bytes and all of them are caused by our simple std::make_unique<int>
!
Let's take a step back: if we run the same experiment, but with a simple thread, we will see, that this int
allocation causes allocating 67'108'864
bytes of memory, which is exactly 64 MB. What happens??
It all comes down to the implementation of malloc
(as we all know, that new/new[]
is internally implemented with malloc
.. by default).
malloc
internally uses a memory allocator, called ptmalloc2
- the default memory allocator in Linux, that supports threads.
Simply put, this allocator deals with the following terms:
per thread arena
: a huge area of memory; usually per thread, for performance reasons; not all software threads have their own per-thread-arenas, this usually depends on the number of hardware threads (and other details, I guess);heap
: the arena
s are divided into heaps;chunks
: the heap
s are divided into smaller areas of memory, called chunks
.There are a lot of details about these things, will post some interesting links a bit later, although this should be enough for the reader to do their own research - these are really low-level and deep things, related to C++ memory management.
So, let's go back to our test with a single thread - allocated 64 MiB for a single int
?? Let's see again the stack trace and concentrate at its end:
mmap (mmap.c:34)
new_heap (arena.c:438)
arena_get2.part.3 (arena.c:646)
malloc (malloc.c:2911)
Surprise, surprise: malloc
calls arena_get2
, which calls new_heap
, which leads us to mmap
(mmap
and brk
are the low level system calls, used for memory allocation in Linux). And this is reported to allocate exactly 64 MiB memory.
OK, let's now go back to our original example with the 192 threads and our huge number 6'375'342'080
- this is exactly 95 * 64 MiB!
Why exactly 95 - I can't really say, I stopped digging, but the fact, that the big number is divisible to 64 MiB was good enough for me.
You can dig a lot deeper, if necessary.
Really cool explanatory article: Understanding glibc malloc, by sploitfun
A more formal/official documentation: The GNU allocator
A cool stack exchange question: How does glibc malloc works
Others:
If some of these links are broken at the moment of reading this post, it should be fairly easy to find similar articles. This topic is very popular, if you know what to look for and how.
I hope these observations give good high-level description of the whole picture and also give enough food for further extended research.
Feel free to comment / (suggest) edit / correct / extend / etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With