Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast capture stack trace on windows / 64-bit / mixed mode

Like most of you probably know there exists plenty of different mechanisms to walk stack traces, starting from windows api, and continuing further into depth of magical assembly world - let me list here some of links I've already studied.

For of all let me mention that I want to have mechanism of memory leak analysis of mixed mode (managed and unmanaged) / 64-bit + AnyCPU application and from all windows api's CaptureStackBackTrace is most suitable for my needs, but as I have analyzed - it does not support managed code stack walking. But that function API is closest to what I need (Since it also calculates back trace hash - unique identifier of particular call stack).

I've excluded different approaches of locating memory leaks - most of software which I have tried is either crashing or does not work unreliably, or gives bad results.

Also I don't want to recompile existing software and override malloc / new other mechanism - because it's heavy task (we have huge code base with a lot of dll's). Also I suspect this is not one time job which I need to perform - issue comes back with 1-2 year cycle, depending of who and what was coding, so I would prefer to have built-in memory leaks detection in application itself (memory api hooking) instead of fighting with this problem over and over again.

http://www.codeproject.com/Articles/11132/Walking-the-callstack

Uses StackWalk64 windows API function, but does not work with managed code. Also 64 bit support is not fully clear to me - I have seen some walkaround for 64 bit problem - I suspect that this code does not fully work when stack walk is done within same thread.

Then there exists process hacker: http://processhacker.sourceforge.net/

Which also use StackWalk64, but extends it's call back function (7th and 8-th parameters) to support mixed mode stack walking. After a lot of complexities with 7/8 call back functions I've managed to reach also support of StackWalk64 with mixed mode support (Catching stack trace as vector - where each pointer refers to assembly / dll location where call went by). But as you may guess - the performance of StackWalk64 is insufficient for my needs - even with simple Message box from C# side, application simply "hangs" for a while until it starts up correctly.

I haven't seen such heavy delays with CaptureStackBackTrace function call, so I assume that performance of StackWalk64 is insufficient for my needs.

There exists also COM based approach of stack trace determination - like this: http://www.codeproject.com/Articles/371137/A-Mixed-Mode-Stackwalk-with-the-IDebugClient-Inter

http://blog.steveniemitz.com/building-a-mixed-mode-stack-walker-part-1/

but what I'm afraid of - it requires COM , and thread needs to be com initialized, and due to memory api hooking I should not touch com state within any thread, because it can results in heavier problems (e.g. incorrect apartment initialization, other glitches)

Now I've reached the point when windows API becomes insufficient for my own needs, and I need to walk through call stack manually. Such examples can be found for example:

http://www.codeproject.com/Articles/11221/Easy-Detection-of-Memory-Leaks See function FillStackInfo / 32 bit only, does not support managed code.

There are couple of mentions about reversing stack trace - for example on following links:

  1. http://blog.airesoft.co.uk/2009/02/grabbing-kernel-thread-contexts-the-process-explorer-way/
  2. http://cbloomrants.blogspot.fi/2009/01/01-30-09-stack-tracing-on-windows.html
  3. http://www.gamedev.net/topic/364861-stack-dump-on-win32-how-to-get-api-addresses/
  4. http://eli.thegreenplace.net/2011/09/06/stack-frame-layout-on-x86-64/

Especially 1, 3, 4 links give some interesting night reading. :-)

But even thus they are rather interesting mechanisms, there is no fully working demo example on any of them.

I guess one of examples is Wine's dbghelp implementation (Windows "emulator" for linux), which also shows how exactly StackWalk64 works at the end, but I suspect that it's heavily bound to DWARF2 file format executable, so it's not identical to current windows PE executable file format.

Can someone pintpoint me to good implementation of stack walking, working on 64-bit architecture, with mixed mode support (can track native and managed memory allocations), which is bound purely in register / call stack / code analysis. (Combined implementations of 1, 3, 4)

Does someone has any good contact from Microsoft development team, who could potentially answer to this question ?

like image 386
TarmoPikaro Avatar asked Dec 28 '15 22:12

TarmoPikaro


2 Answers

x64 Stack walking is complicated as you have already found out. A simple alternative is to simply not do it but leave the hard things to the OS ETW stackwalker. That works and it is much faster than you will ever get.

You can take advantage of it by emitting your own ETW event. Before that you need to start an ETW session for your event provider and enable stack walking for your provider. There is a catch on Windows 7 where it does not work unless the managed stack frames are all NGenned because the x64 ETW Stackwalker will stop if he finds a stack frame which is not in any loaded module which is true for JITed code.

Starting with Windows 8 the ETW Stackwalker will always walk the first MB of the stack for stack frames which fixes the JIT problem. The JIT compiler emits Unwind Infos for the generated code if ETW tracing is on and registers it via RtlAddGrowableFunctionTable which makes it possible to walk the stack fast from within the kernel in the first place. Things work differently when ETW tracing is not enabled for compatibility reasons.

If you are after malloc/free new/delete memory leaks you can also use the OS bultin capabilities of Heap allocation tracing which already exists since Windows 7. See xperf -help start and https://randomascii.wordpress.com/2015/04/27/etw-heap-tracingevery-allocation-recorded/ for more information about heap allocation tracing. You can enable it for an already running process with no problems. The downside is that for any real world application the generated data is huge. But if you are after big allocations only then it can help to track only VirtualAlloc calls which can also be enabled with minimal overhead.

Managed code since .NET 4.5 has also its own ETW allocation tracing provider with full stack walking even on x64 Windows 7 because it does a full managed stackwalk by itself. More infos can be found in the CoreClr Sources at: ETW::SamplingLog::SendStackTrace in https://github.com/dotnet/coreclr/blob/master/src/inc/eventtracebase.h for many more details.

That is only a rough outline what is possible. To really get all of the necessary details would take a whole book I fear. And I am still learning new things every day.

Here is a heapalloc.cmd script which you can use to track heap allocations. By default it records into a 500MB ring buffer if your leak builds up over longer periods of time recording all allocation stacks without condensing them at runtime will not work out with WPA. But you could post process a huge ETL file and write your own viewer for it.

@echo off 
setlocal enabledelayedexpansion
REM consider using a different drive for ETL output to prevent slowing down 
REM your application and to prevent lost buffers
set OUTDIR=C:\TEMP
set OUTFILENAME=HeapTracing.etl
REM Final output file
set OUTFILE=!OUTDIR!\!OUTFILENAME!
set CLRUNDOWNFILE=!OUTDIR!\clr_HeapDCend.etl
set KERNELFILE=!OUTDIR!\kernel.etl
set CLRSESSIONFILE=!OUTDIR!\clrHeapSession.etl
set HEAPUSERFILE=!OUTDIR!\HeapUserSession.etl
REM Default is allocation and realloc to track memory leaks
REM HeapFree is the other option to track double free calls
set HEAPTRACINGFLAGS=HeapAlloc+HeapRealloc 

if "%3" NEQ "" (
echo Overriding Heap Tracing Flags with: %3
set HEAPTRACINGFLAGS=%3
)


if "%1" EQU "-start" ( 
    call :StartTracing -PidNewProcess %2
    goto :Exit 
) 

if "%1" EQU "-attachPid" ( 
    call :StartTracing -Pids %2
    goto :Exit 
) 

if "%1" EQU "-startNext" (
    reg add "HKLM\Software\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\%~nx2" /v TracingFlags /t REG_DWORD /d 1 /f
    if not %errorlevel% == 0 goto failure
    call :StartTracing -Pids 0
    goto :Exit
)

if "%1" EQU "-stop" ( 
    set XPERF_CreateNGenPdbs=1
    xperf -start ClrRundownSession -on e13c0d23-ccbc-4e12-931b-d9cc2eee27e4:0x118:5+a669021c-c450-4609-a035-5af59af4df18:0x118:5 -f "!CLRUNDOWNFILE!" -buffersize 256 -minbuffers 256 -maxbuffers 512 
    call :WaitUntilRundownCompleted "!CLRUNDOWNFILE!"
    xperf -stop -stop ClrSession ClrRundownSession HeapSession | findstr /V identifiable 2> NUL

    echo Merging profiles
    REM Reset symbol path to create the pdbs files in the output directory with in the directory with the same name like our etl file
    set TMPSYMBOLPATH=!_NT_SYMBOL_PATH!
    REM Each tool is using a different pdb cache folder. If you are using them side by side 
    REM you have to wait a long time to refresh the pdb cache. To spare the waiting time we use 
    REM the pdb cache folder from WPR

    mkdir C:\ProgramData\WindowsPerformanceRecorder\NGenPdbs_Cache 2> NUL
    set _NT_SYMBOL_PATH=srv*C:\ProgramData\WindowsPerformanceRecorder\NGenPdbs_Cache 
    mklink /D "!OUTFILE!.NGENPDB" C:\ProgramData\WindowsPerformanceRecorder\NGenPdbs_Cache  2> NUL

    echo Managed PDBs are stored at: !OUTFILE!.NGENPDB. If you want to transfer the etl do not forget to copy this directory with the pdbs as well. 
    echo Merging ETL files and generating native pdbs

    xperf -merge  "!KERNELFILE!" "!CLRSESSIONFILE!" "!CLRUNDOWNFILE!" "!HEAPUSERFILE!" "!OUTFILE!"
    set _NT_SYMBOL_PATH=!TMPSYMBOLPATH!
    echo !OUTFILE! was created

    if "%2" NEQ "" reg delete "HKLM\Software\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\%~nx2" /v TracingFlags /f 2> NUL
    goto :Exit 
) 

goto Usage:

:StartTracing
xperf -start ClrSession -on  Microsoft-Windows-DotNETRuntime:5 -f "!CLRSESSIONFILE!" -buffersize 128 -minbuffers 256 -maxbuffers 512 
xperf -on PROC_THREAD+LOADER+latency+virt_alloc -stackwalk VirtualAlloc  -f "%KERNELFILE%"
xperf -start HeapSession -heap %1 %2 -BufferSize 1024 -MinBuffers 128 -MaxBuffers 1024 -stackwalk %HEAPTRACINGFLAGS% -f "!HEAPUSERFILE!" -FileMode Circular -MaxFile 1024
exit /B

REM Wait until writing to ETL file has stopped by checking its file size
:WaitUntilRundownCompleted
:StillWriting
    for %%F in (%1) do set "size=%%~zF"
    timeout /T 1  > nul
    for %%F in (%1) do set "size2=%%~zF"
    if "!size!" EQU "" goto :EndWriting
    if "!size!" NEQ "!size2!" goto StillWriting
:EndWriting
timeout /T 1  > nul
exit /B


:Usage
    echo Usage: 
    echo HeapAlloc.cmd -start [executable] or -stop
    echo               -start [executable] Start a trace session 
    echo               -startNext [executable] Start heap tracing for all subsequent calls to executable.
    echo               -attachPid ddd Start a trace session for specified process
    echo               -stop  [executable] Stop a trace session 
    echo Examples
    echo     HeapAlloc.cmd -startNext devenv.exe
    echo     HeapAlloc.cmd -stop      devenv.exe
    echo To attach to a running process
    echo     HeapAlloc.cmd -attachPid dddd
    echo     HeapAlloc.cmd -stop 
    echo You must call -stop for your executable if you have used -start or startNext because heap allocation tracing will enabled until you stop it!
goto :Exit 

:failure
    echo Error occured
goto :Exit

:Exit
like image 128
Alois Kraus Avatar answered Oct 13 '22 00:10

Alois Kraus


25.1.2016 Writing as separate issue, as complementary information.

For stack unique id CaptureStackBackTrace uses simple sum of all instruction pointers - idea is borrowed from: "Windows_Research_Kernel(sources)\WRK-v1.2\base\ntos\rtl\amd64\stkwalk.c":

    size_t hashValue = 0;

    for (int i = 0; i < nFrames; i++)
        hashValue += PtrToUlong(BackTrace[i]);

    *pBackTraceHash = (DWORD)hashValue;

I'm not sure about last conversion - some specify last parameter as DWORD, some as ulong64, but it's not relevant. The main problem with this calculation is that it's not unique enough. For case of recursive function calls - if you have call order:

func1
func2
func3

Stack trace for:

func1
func3
func2

Will be identical.

What I have debugged - for memory leak detection I'm getting 62876 false hits - unique stack id calculation is not reliable enough.

I've bit switched formula to:

static DWORD crc32_tab[] =
{
    0x00000000, 0x77073096, 0xee0e612c, 0x990951ba, 0x076dc419, 0x706af48f,
    0xe963a535, 0x9e6495a3, 0x0edb8832, 0x79dcb8a4, 0xe0d5e91e, 0x97d2d988,
    0x09b64c2b, 0x7eb17cbd, 0xe7b82d07, 0x90bf1d91, 0x1db71064, 0x6ab020f2,
    0xf3b97148, 0x84be41de, 0x1adad47d, 0x6ddde4eb, 0xf4d4b551, 0x83d385c7,
    0x136c9856, 0x646ba8c0, 0xfd62f97a, 0x8a65c9ec, 0x14015c4f, 0x63066cd9,
    0xfa0f3d63, 0x8d080df5, 0x3b6e20c8, 0x4c69105e, 0xd56041e4, 0xa2677172,
    0x3c03e4d1, 0x4b04d447, 0xd20d85fd, 0xa50ab56b, 0x35b5a8fa, 0x42b2986c,
    0xdbbbc9d6, 0xacbcf940, 0x32d86ce3, 0x45df5c75, 0xdcd60dcf, 0xabd13d59,
    0x26d930ac, 0x51de003a, 0xc8d75180, 0xbfd06116, 0x21b4f4b5, 0x56b3c423,
    0xcfba9599, 0xb8bda50f, 0x2802b89e, 0x5f058808, 0xc60cd9b2, 0xb10be924,
    0x2f6f7c87, 0x58684c11, 0xc1611dab, 0xb6662d3d, 0x76dc4190, 0x01db7106,
    0x98d220bc, 0xefd5102a, 0x71b18589, 0x06b6b51f, 0x9fbfe4a5, 0xe8b8d433,
    0x7807c9a2, 0x0f00f934, 0x9609a88e, 0xe10e9818, 0x7f6a0dbb, 0x086d3d2d,
    0x91646c97, 0xe6635c01, 0x6b6b51f4, 0x1c6c6162, 0x856530d8, 0xf262004e,
    0x6c0695ed, 0x1b01a57b, 0x8208f4c1, 0xf50fc457, 0x65b0d9c6, 0x12b7e950,
    0x8bbeb8ea, 0xfcb9887c, 0x62dd1ddf, 0x15da2d49, 0x8cd37cf3, 0xfbd44c65,
    0x4db26158, 0x3ab551ce, 0xa3bc0074, 0xd4bb30e2, 0x4adfa541, 0x3dd895d7,
    0xa4d1c46d, 0xd3d6f4fb, 0x4369e96a, 0x346ed9fc, 0xad678846, 0xda60b8d0,
    0x44042d73, 0x33031de5, 0xaa0a4c5f, 0xdd0d7cc9, 0x5005713c, 0x270241aa,
    0xbe0b1010, 0xc90c2086, 0x5768b525, 0x206f85b3, 0xb966d409, 0xce61e49f,
    0x5edef90e, 0x29d9c998, 0xb0d09822, 0xc7d7a8b4, 0x59b33d17, 0x2eb40d81,
    0xb7bd5c3b, 0xc0ba6cad, 0xedb88320, 0x9abfb3b6, 0x03b6e20c, 0x74b1d29a,
    0xead54739, 0x9dd277af, 0x04db2615, 0x73dc1683, 0xe3630b12, 0x94643b84,
    0x0d6d6a3e, 0x7a6a5aa8, 0xe40ecf0b, 0x9309ff9d, 0x0a00ae27, 0x7d079eb1,
    0xf00f9344, 0x8708a3d2, 0x1e01f268, 0x6906c2fe, 0xf762575d, 0x806567cb,
    0x196c3671, 0x6e6b06e7, 0xfed41b76, 0x89d32be0, 0x10da7a5a, 0x67dd4acc,
    0xf9b9df6f, 0x8ebeeff9, 0x17b7be43, 0x60b08ed5, 0xd6d6a3e8, 0xa1d1937e,
    0x38d8c2c4, 0x4fdff252, 0xd1bb67f1, 0xa6bc5767, 0x3fb506dd, 0x48b2364b,
    0xd80d2bda, 0xaf0a1b4c, 0x36034af6, 0x41047a60, 0xdf60efc3, 0xa867df55,
    0x316e8eef, 0x4669be79, 0xcb61b38c, 0xbc66831a, 0x256fd2a0, 0x5268e236,
    0xcc0c7795, 0xbb0b4703, 0x220216b9, 0x5505262f, 0xc5ba3bbe, 0xb2bd0b28,
    0x2bb45a92, 0x5cb36a04, 0xc2d7ffa7, 0xb5d0cf31, 0x2cd99e8b, 0x5bdeae1d,
    0x9b64c2b0, 0xec63f226, 0x756aa39c, 0x026d930a, 0x9c0906a9, 0xeb0e363f,
    0x72076785, 0x05005713, 0x95bf4a82, 0xe2b87a14, 0x7bb12bae, 0x0cb61b38,
    0x92d28e9b, 0xe5d5be0d, 0x7cdcefb7, 0x0bdbdf21, 0x86d3d2d4, 0xf1d4e242,
    0x68ddb3f8, 0x1fda836e, 0x81be16cd, 0xf6b9265b, 0x6fb077e1, 0x18b74777,
    0x88085ae6, 0xff0f6a70, 0x66063bca, 0x11010b5c, 0x8f659eff, 0xf862ae69,
    0x616bffd3, 0x166ccf45, 0xa00ae278, 0xd70dd2ee, 0x4e048354, 0x3903b3c2,
    0xa7672661, 0xd06016f7, 0x4969474d, 0x3e6e77db, 0xaed16a4a, 0xd9d65adc,
    0x40df0b66, 0x37d83bf0, 0xa9bcae53, 0xdebb9ec5, 0x47b2cf7f, 0x30b5ffe9,
    0xbdbdf21c, 0xcabac28a, 0x53b39330, 0x24b4a3a6, 0xbad03605, 0xcdd70693,
    0x54de5729, 0x23d967bf, 0xb3667a2e, 0xc4614ab8, 0x5d681b02, 0x2a6f2b94,
    0xb40bbe37, 0xc30c8ea1, 0x5a05df1b, 0x2d02ef8d
};

if (pBackTraceHash)
{

    size_t hashValue = 0;
    for( int idxFrame = 0; idxFrame < (int)iFrame; idxFrame++ )
    {
        unsigned char* p = (unsigned char*)&BackTrace[idxFrame];
        for( int i = 0; i < sizeof(void*); i++ )
            hashValue = crc32_tab[ ((hashValue ^ *p++) & 0xFF) ] ^ (hashValue >> 8);
    }
    *pBackTraceHash = (DWORD)hashValue;
}

This algorithm does not give false hits, but slows down execution a little bit.

Also memory leak statistics differs: Unreliable algorithm: Total amount of leaked memory: 48'874'764 / in 371 allocation pools Crc32 based algorithm: Total amount of leaked memory: 48'874'764 / in 614 allocation pools

Like you see - statistics combines (pools) similar call stack together - less fragmentation, but original call stack is lost. (Incorrect statistics)

May be someone could give me some faster algorithm for this ?

like image 34
TarmoPikaro Avatar answered Oct 12 '22 23:10

TarmoPikaro