Why is goroutine allocation slower on multiple cores?

Question

I was doing some experiments in Go and I found something really odd. When I run the following code on my computer it executes in ~0.5 seconds.

package main

import (
  "fmt"
  "runtime"
  "time"
)
func waitAround(die chan bool) {
  <- die
}
func main() {
  var startMemory runtime.MemStats
  runtime.ReadMemStats(&startMemory)

  start := time.Now()
  cpus := runtime.NumCPU()
  runtime.GOMAXPROCS(cpus)
  die := make(chan bool)
  count := 100000
  for i := 0; i < count; i++ {
    go waitAround(die)
  }
  elapsed := time.Since(start)

  var endMemory runtime.MemStats
  runtime.ReadMemStats(&endMemory)

  fmt.Printf("Started %d goroutines
%d CPUs
%f seconds
",
    count, cpus, elapsed.Seconds())
  fmt.Printf("Memory before %d
memory after %d
", startMemory.Alloc,
    endMemory.Alloc)
  fmt.Printf("%d goroutines running
", runtime.NumGoroutine())
  fmt.Printf("%d bytes per goroutine
", (endMemory.Alloc - startMemory.Alloc)/uint64(runtime.NumGoroutine()))

  close(die)
}

However, when I execute it using runtime.GOMAXPROCS(1) it executes much faster (~0.15 seconds). Can anybody explain to me why running many goroutines would be slower using more cores? Is there any significant overhead to multiplexing the goroutines onto multiple cores? I realize the goroutines aren't doing anything and it would probably be a different story if I had to wait for the routines to actually do something.

tylerl · Accepted Answer

When running on a single core, goroutine allocation and switching is just a matter of internal accounting. Goroutines are never preempted, so the switching logic is extremely simple and very fast. And more importantly in this case, your main routine does not yield at all, so the goroutines never even begin execution before they're terminated. You allocate the structure and then delete it, and that's that. (edit This may not be true with newer versions of go, but it is certainly more orderly with only 1 process)

But when you map routines over multiple threads, then you suddenly get os-level context switching involved, which is orders of magnitude slower and more complex. And even if you're on multiple cores, there's a lot more work that has to be done. Plus now your gouroutines may actually be running before the program gets terminated.

Try straceing the program under both conditions and see how its behavior differs.

eandersson · Answer

It is always difficult to measure performance over multiple cores unless you have a significant work load that benefits from working over multiple cores. The problem is that the code needs to be shared amongst the threads and cores, which means that while there may not be huge overhead, but still a significant amount, especially for simple code, lowering the overall performance.

And like you mentioned it would be a completely different story if you did something CPU intensive.

Why is goroutine allocation slower on multiple cores?

Tags:

multithreading

go

Danny Dyla

2 Answers

tylerl

eandersson

Recent Activity

Donate For Us

Why is goroutine allocation slower on multiple cores?

Tags:

multithreading

go

Danny Dyla

2 Answers

tylerl

eandersson

Related questions

Recent Activity

Donate For Us