Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DotNet Core 2.1 hoarding memory in Linux

I have a websocket server that hoards memory during days, till the point that Kubernetes eventually kills it. We monitor it using prometheous-net.

# dotnet --info

Host (useful for support):
  Version: 2.1.6
  Commit:  3f4f8eebd8

.NET Core SDKs installed:
  No SDKs were found.

.NET Core runtimes installed:
  Microsoft.AspNetCore.All 2.1.6 [/usr/share/dotnet/shared/Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.App 2.1.6 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 2.1.6 [/usr/share/dotnet/shared/Microsoft.NETCore.App]

But when I connect remotely and take a memory dump (using createdump), suddently the memory drops... without the service stopping, restarting or loosing any connected user. See the green line in the picture.

I can see in the graphs, that GC is collecting regularly in all generations.

GC Server is disabled using:

<PropertyGroup>
  <ServerGarbageCollection>false</ServerGarbageCollection>
</PropertyGroup>

Before disabling GC Server, the service used to grow memory way faster. Now it takes two weeks to get into 512Mb.

Other services using ASP.NET Core on request/response fashion do not show this problem. This uses Websockets, where each connection last usually around 10 minutes... so I guess everything related with the connection survives till Gen 2 easily.

enter image description here Note that there are two pods, showing the same behaviour, and then one (the green) drops suddenly in memory ussage due the taking of the memory dump.

enter image description here

The pods did not restart during the taking of the memory dump: enter image description here

No connection was lost or restarted.

Heap:

(lldb) eeheap -gc
Number of GC Heaps: 1
generation 0 starts at 0x00007F8481C8D0B0
generation 1 starts at 0x00007F8481C7E820
generation 2 starts at 0x00007F852A1D7000
ephemeral segment allocation context: none
         segment             begin         allocated              size
00007F852A1D6000  00007F852A1D7000  00007F853A1D5E90  0xfffee90(268430992)
00007F84807D0000  00007F84807D1000  00007F8482278000  0x1aa7000(27947008)
Large object heap starts at 0x00007F853A1D7000
         segment             begin         allocated              size
00007F853A1D6000  00007F853A1D7000  00007F853A7C60F8  0x5ef0f8(6222072)
Total Size:              Size: 0x12094f88 (302600072) bytes.
------------------------------
GC Heap Size:            Size: 0x12094f88 (302600072) bytes.
(lldb)

Free objects:

(lldb) dumpheap -type Free -stat
Statistics:
              MT    Count    TotalSize Class Name
00000000010c52b0   219774     10740482      Free
Total 219774 objects

Is there any explanation to this behaviour?

like image 749
Vlad Avatar asked Dec 20 '18 12:12

Vlad


1 Answers

The problem was the connection to RabbitMQ. Because we were using sort lived channels, the "auto-reconnect" feature of the RabbitMQ.Client was keeping a lot of state about dead channels. We switched off this configuration, since we do not need the "perks" of the "auto-reconnect" feature, and everything start working normally. It was a pain, but we basically had to setup a Windows deploy and do the usual memory analysis process with Windows tools (Jetbrains dotMemory in this case). Using lldb is not productive at all.

like image 198
Vlad Avatar answered Oct 05 '22 03:10

Vlad