I have a process (that is started by a watch-dog every time, it's stopped for some reason), that uses usually about 200MB memory. Once I saw it's eating up the memory - with memory usage about 1.5-2GB, which definitely means a "memory leak" somewhere ( "memory leak" in quotes, as that is not a real memory leak - like allocated memory, never freed and unreachable - please note, that only smart pointers are used. So, I think about some huge container (I didn't find) or something like this )
Later, the process crashed, because of the high memory usage and a core dump was generated - about 2GB. But the problem is, that I can't reproduce the issue, so valgrind
won't help here (I guess). It happens very rarely and I can't "catch" it.
So, my question is - is there a way, using the exe and the core file, to locate which part of the process, has used most of the memory?
I took a look at the core file with gdb
, there's nothing unusual. But the core is big, so there must be something. Is there a clever way to understand what has happened, or only guessing may help (but for such big exe.., 12 threads, about 50-100 (may be more) classes, etc, etc. )
It's a C++
application, running on RHEL5U3.
A . core file is created in the current directory when various errors occur. Errors such as memory-address violations, illegal instructions, bus errors, and user-generated quit signals, commonly cause this core dump. The core file that is created contains a memory image of the terminated process.
With a core file, we can use the debugger (GDB) to inspect the state of the process at the moment it was terminated and to identify the line of code that caused the problem. That's a situation where a core dump file could be produced, but it's not by default.
System core files (Linux® and UNIX) If a program terminates abnormally, a core file is created by the system to store a memory image of the terminated process. Errors such as memory address violations, illegal instructions, bus errors, and user-generated quit signals cause core files to be dumped.
Open this coredump in hexadecimal format (as bytes/words/dwords/qwords). Starting from the file's middle try to notice any repeating pattern. If anything is found, try to determine starting address and the length of some possible data structure. Using length and contents of this structure, try to guess what might it be. Using the address, try to find some pointer to this structure. Repeat until you come to either stack or some global variable. In case of stack variable, you'll easily know in which function this chain starts. In case of global variable, you know at least its type.
If you cannot find any pattern in the coredump, chances are that leaking structure is very big. Just compare what you see in the file with possible contents of all large structures in the program.
Update
If your coredump has valid call stack, you can start with inspecting its functions. Search for anything unusual. Check if memory allocations near the top of the call stack do not request too much. Check for possible infinite loops in the call stack functions.
Words "only smart pointers are used" frighten me. If significant part of these smart pointers are shared pointers (shared_ptr, intrusive_ptr, ...), instead of searching for huge containers, it is worth to search for shared pointer cycles.
Update 2
Try to determine where your heap ends in the corefile (brk
value). Run coredumped process under gdb and use pmap
command (from other terminal). gdb should also know this value, but I have no idea how to ask it... If most of the process' memory is above brk
, you can limit your search by large memory allocations (most likely, std::vector).
To improve chances of finding leaks in heap area of the existing coredump, some coding may be used (I didn't do it myself, just a theory):
Coredump file is in elf
format. Only start and size of data segment is needed from its header. To simplify process, just read it as linear file, ignoring structure.
Once I saw it's eating up the memory - with memory usage about 1.5-2GB
Quite often this would be an end result of an error loop going astray. Something like:
size_t size = 1;
p = malloc(size);
while (!enough_space(size)) {
size *= 2;
p = realloc(p, size);
}
// now use p to do whatever
If enough_space()
erroneously returns false under some conditions, your process will quickly grow to consume all memory available.
only smart pointers are used
Unless you control all code linked into the process, above statement is false. The error loop could be inside libc
, or any other library that you don't own.
only guessing may help
That's pretty much it. Evgeny's answer has good starting points to help you guess.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With