i have an application written in Go is doing message processing, need to pick up message from network (UDP) at a rate of 20K/second (potentially more), and each message can be up to UDP packet's maximum length (64KB-headersize), the program needs to decode this incoming packet and encode into another format and send to another network;
right now on a 24core+64GB RAM machine it runs ok, but occasionally lost some packets, the programming pattern is already following the pipelines using multiple go-routines / channels and it takes 10% of whole machine cpu load; so it has potential to use more CPU% or RAM to handle all 20K/s messages without losing any one; then i started profiling, following this profiling I found in cpu profile the runtime.mallocgc
appears the top one, that is garbage collector runtime, I suspect this GC could be the culprit it hangs for a few millisecond (or some microseconds) and lost some packets, and some best practices says switch to sync.Pool might help, but my switch to pool seemingly results more CPU contention and lost even more packets and more often
(pprof) top20 -cum (sync|runtime)
245.99s of 458.81s total (53.61%)
Dropped 487 nodes (cum <= 22.94s)
Showing top 20 nodes out of 22 (cum >= 30.46s)
flat flat% sum% cum cum%
0 0% 0% 440.88s 96.09% runtime.goexit
1.91s 0.42% 1.75% 244.87s 53.37% sync.(*Pool).Get
64.42s 14.04% 15.79% 221.57s 48.29% sync.(*Pool).getSlow
94.29s 20.55% 36.56% 125.53s 27.36% sync.(*Mutex).Lock
1.62s 0.35% 36.91% 72.85s 15.88% runtime.systemstack
22.43s 4.89% 41.80% 60.81s 13.25% runtime.mallocgc
22.88s 4.99% 46.79% 51.75s 11.28% runtime.scanobject
1.78s 0.39% 47.17% 49.15s 10.71% runtime.newobject
26.72s 5.82% 53.00% 39.09s 8.52% sync.(*Mutex).Unlock
0.76s 0.17% 53.16% 33.74s 7.35% runtime.gcDrain
0 0% 53.16% 33.70s 7.35% runtime.gcBgMarkWorker
0 0% 53.16% 33.69s 7.34% runtime.gcBgMarkWorker.func2
the use of pool is the standard
// create this one globally at program init
var rfpool = &sync.Pool{New: func() interface{} { return new(aPrivateStruct); }}
// get
rf := rfpool.Get().(*aPrivateStruct)
// put after done processing this message
rfpool.Put(rf)
not sure am I doing wrong? or wonder what are the other ways could tune the GC to use less CPU%? the go version is 1.8
the list shows a lot of lock contention happened in the pool.getSlow src to pool.go at golang.org
(pprof) list sync.*.getSlow
Total: 7.65mins
ROUTINE ======================== sync.(*Pool).getSlow in /opt/go1.8/src/sync/pool.go
1.07mins 3.69mins (flat, cum) 48.29% of Total
. . 144: x = p.New()
. . 145: }
. . 146: return x
. . 147:}
. . 148:
80ms 80ms 149:func (p *Pool) getSlow() (x interface{}) {
. . 150: // See the comment in pin regarding ordering of the loads.
30ms 30ms 151: size := atomic.LoadUintptr(&p.localSize) // load-acquire
180ms 180ms 152: local := p.local // load-consume
. . 153: // Try to steal one element from other procs.
30ms 130ms 154: pid := runtime_procPin()
20ms 20ms 155: runtime_procUnpin()
730ms 730ms 156: for i := 0; i < int(size); i++ {
51.55s 51.55s 157: l := indexLocal(local, (pid+i+1)%int(size))
580ms 2.01mins 158: l.Lock()
10.65s 10.65s 159: last := len(l.shared) - 1
40ms 40ms 160: if last >= 0 {
. . 161: x = l.shared[last]
. . 162: l.shared = l.shared[:last]
. 10ms 163: l.Unlock()
. . 164: break
. . 165: }
490ms 37.59s 166: l.Unlock()
. . 167: }
40ms 40ms 168: return x
. . 169:}
. . 170:
. . 171:// pin pins the current goroutine to P, disables preemption and returns poolLocal pool for the P.
. . 172:// Caller must call runtime_procUnpin() when done with the pool.
. . 173:func (p *Pool) pin() *poolLocal {
sync.Pool runs slowly with a high concurrency load. Try to allocate all the structures once during the startup and use it many times. For example, you may create several goroutines (workers) at start time, instead of run new goroutine on each request. I recommend read this article: https://software.intel.com/en-us/blogs/2014/05/10/debugging-performance-issues-in-go-programs .
https://golang.org/pkg/sync/#Pool
a free list maintained as part of a short-lived object is not a suitable use for a Pool, since the overhead does not amortize well in that scenario. It is more efficient to have such objects implement their own free list
https://dave.cheney.net/2015/11/29/a-whirlwind-tour-of-gos-runtime-environment-variables
http://golang-jp.org/doc/effective_go.html#leaky_buffer
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With