Why does adding concurrency slow down this golang code?

Tags:

go

I've got a bit of Go code that I've been tinkering with to answer a little curiosity of mine related to a video game my brother-in-law plays.

Essentially, the code below simulates interactions with monsters in the game and how often he can expect them to drop items upon their defeat. The problem I'm having is that I would expect a piece of code like this to be perfect for parallelization, but when I add in concurrency the time it takes to do all of the simulations tends to slow down by 4-6 times the original without concurrency.

To give you a better understanding of how the code works, I have three main functions: The interaction function which is a simple interaction between the player and a monster. It returns 1 if the monster drops an item, and 0 otherwise. The simulation function runs several interactions and returns a slice of interaction results (i.e., 1's and 0's representing successful/unsuccessful interactions). Finally, there is the test function which runs a set of simulations and returns a slice of simulation results which are the total number of interactions that resulted in a dropped item. It's the last function which I am trying to run in parallel.

Now, I could understand why the code would slow down if I created a goroutine for each test that I want to run. Assuming I'm running 100 tests, the context switching between each of the goroutines across the 4 CPUs my MacBook Air has would kill the performance, but I'm only creating as many goroutines as I have processors and dividing the number of tests between the goroutines. I would expect this to actually speed up the code's performance since I am running each of my tests in parallel, but, of course, I'm getting a major slow down instead.

I'd love to figure out why this is happening, so any help would be greatly appreciated.

Below is the regular code without the go routines:

package main  import (     "fmt"     "math/rand"     "time" )  const (     NUMBER_OF_SIMULATIONS = 1000     NUMBER_OF_INTERACTIONS = 1000000     DROP_RATE = 0.0003 )  /**  * Simulates a single interaction with a monster  *  * Returns 1 if the monster dropped an item and 0 otherwise  */ func interaction() int {     if rand.Float64() <= DROP_RATE {         return 1     }     return 0 }  /**  * Runs several interactions and retuns a slice representing the results  */ func simulation(n int) []int {     interactions := make([]int, n)     for i := range interactions {         interactions[i] = interaction()     }     return interactions }  /**  * Runs several simulations and returns the results  */ func test(n int) []int {     simulations := make([]int, n)     for i := range simulations {         successes := 0         for _, v := range simulation(NUMBER_OF_INTERACTIONS) {             successes += v         }         simulations[i] = successes     }     return simulations }  func main() {     rand.Seed(time.Now().UnixNano())     fmt.Println("Successful interactions: ", test(NUMBER_OF_SIMULATIONS)) }

And, here is the concurrent code with the goroutines:

package main  import (     "fmt"     "math/rand"     "time"     "runtime" )  const (     NUMBER_OF_SIMULATIONS = 1000     NUMBER_OF_INTERACTIONS = 1000000     DROP_RATE = 0.0003 )  /**  * Simulates a single interaction with a monster  *  * Returns 1 if the monster dropped an item and 0 otherwise  */ func interaction() int {     if rand.Float64() <= DROP_RATE {         return 1     }     return 0 }  /**  * Runs several interactions and retuns a slice representing the results  */ func simulation(n int) []int {     interactions := make([]int, n)     for i := range interactions {         interactions[i] = interaction()     }     return interactions }  /**  * Runs several simulations and returns the results  */ func test(n int, c chan []int) {     simulations := make([]int, n)     for i := range simulations {         for _, v := range simulation(NUMBER_OF_INTERACTIONS) {             simulations[i] += v         }     }     c <- simulations }  func main() {     rand.Seed(time.Now().UnixNano())      nCPU := runtime.NumCPU()     runtime.GOMAXPROCS(nCPU)     fmt.Println("Number of CPUs: ", nCPU)      tests := make([]chan []int, nCPU)     for i := range tests {         c := make(chan []int)         go test(NUMBER_OF_SIMULATIONS/nCPU, c)         tests[i] = c     }      // Concatentate the test results     results := make([]int, NUMBER_OF_SIMULATIONS)     for i, c := range tests {         start := (NUMBER_OF_SIMULATIONS/nCPU) * i         stop := (NUMBER_OF_SIMULATIONS/nCPU) * (i+1)         copy(results[start:stop], <-c)     }      fmt.Println("Successful interactions: ", results) }

UPDATE (01/12/13 18:05)

I've added a new version of the concurrent code below that creates a new Rand instance for each goroutine per "the system"'s suggestion below. I'm now seeing a very slight speed up compared to the serial version of the code (around a 15-20% reduction in overall time taken). I'd love to know why I don't see something closer to a 75% reduction in time since I'm spreading the workload over my MBA's 4 cores. Does anyone have any further suggestions that could help out?

package main  import (     "fmt"     "math/rand"     "time"     "runtime" )  const (     NUMBER_OF_SIMULATIONS = 1000     NUMBER_OF_INTERACTIONS = 1000000     DROP_RATE = 0.0003 )  /**  * Simulates a single interaction with a monster  *  * Returns 1 if the monster dropped an item and 0 otherwise  */ func interaction(generator *rand.Rand) int {     if generator.Float64() <= DROP_RATE {         return 1     }     return 0 }  /**  * Runs several interactions and retuns a slice representing the results  */ func simulation(n int, generator *rand.Rand) []int {     interactions := make([]int, n)     for i := range interactions {         interactions[i] = interaction(generator)     }     return interactions }  /**  * Runs several simulations and returns the results  */ func test(n int, c chan []int) {     source := rand.NewSource(time.Now().UnixNano())     generator := rand.New(source)     simulations := make([]int, n)     for i := range simulations {         for _, v := range simulation(NUMBER_OF_INTERACTIONS, generator) {             simulations[i] += v         }     }     c <- simulations }  func main() {     rand.Seed(time.Now().UnixNano())      nCPU := runtime.NumCPU()     runtime.GOMAXPROCS(nCPU)     fmt.Println("Number of CPUs: ", nCPU)      tests := make([]chan []int, nCPU)     for i := range tests {         c := make(chan []int)         go test(NUMBER_OF_SIMULATIONS/nCPU, c)         tests[i] = c     }      // Concatentate the test results     results := make([]int, NUMBER_OF_SIMULATIONS)     for i, c := range tests {         start := (NUMBER_OF_SIMULATIONS/nCPU) * i         stop := (NUMBER_OF_SIMULATIONS/nCPU) * (i+1)         copy(results[start:stop], <-c)     }      fmt.Println("Successful interactions: ", results) }

UPDATE (01/13/13 17:58)

Thanks everyone for the help in figuring out my problem. I did finally get the answer I was looking for and so I thought I would just summarize here for anyone who has the same problem.

Essentially I had two main issues: first, even though my code was embarrassingly parallel, it was running slower when I split it up amongst the available processors, and second, the solution opened up another issue, which was my serial code was running twice as slow as the concurrent code running on single processor, which you would expect to be roughly the same . In both cases the issue was the random number generator function rand.Float64. Basically, this is a convenience function provided by the rand package. In that package, a global instance of the Rand struct is created and used by each of the convenience functions. This global Rand instance has a mutex lock associated with it. Since I was using this convenience function, I wasn't truly able to parallelize my code since each of the goroutines would have to line up for access to the global Rand instance. The solution (as "the system" suggests below) is to create a separate instance of the Rand struct for each goroutine. This solved the first problem but created the second one.

The second problem was that my non-parallel concurrent code (i.e., my concurrent code running with only a single processor) was running twice as fast as the sequential code. The reason for this was that, even though I was only running with a single processor and a single goroutine, that goroutine had its own instance of the Rand struct that I had created, and I had created it without the mutex lock. The sequential code was still using the rand.Float64 convenience function which made use of the global mutex protected Rand instance. The cost of acquiring that lock was causing the sequential code to run twice as slow.

So, the moral of the story is, whenever performance matters, make sure you create an instance of the Rand struct and call the function you need off of it rather than using the convenience functions provided by the package.

985

asked Jan 12 '13 22:01

Christopher Roach

2 Answers

The issue seems to come from your use of rand.Float64(), which uses a shared global object with a Mutex lock on it.

Instead, if for each CPU you create a separate rand.New(), pass it through to the interactions(), and use it to create the Float64(), there's a massive improvement.

Update to show the changes to the new example code in the question that now uses rand.New()

The test() function was modified to either use a given channel, or return the result.

func test(n int, c chan []int) []int {     source := rand.NewSource(time.Now().UnixNano())     generator := rand.New(source)     simulations := make([]int, n)     for i := range simulations {         for _, v := range simulation(NUMBER_OF_INTERACTIONS, generator) {             simulations[i] += v         }        }        if c == nil {         return simulations     }        c <- simulations     return nil  }

The main() function was updated to run both tests, and output the timed result.

func main() {     rand.Seed(time.Now().UnixNano())      nCPU := runtime.NumCPU()     runtime.GOMAXPROCS(nCPU)     fmt.Println("Number of CPUs: ", nCPU)      start := time.Now()     fmt.Println("Successful interactions: ", len(test(NUMBER_OF_SIMULATIONS, nil)))     fmt.Println(time.Since(start))      start = time.Now()     tests := make([]chan []int, nCPU)     for i := range tests {         c := make(chan []int)         go test(NUMBER_OF_SIMULATIONS/nCPU, c)         tests[i] = c     }      // Concatentate the test results     results := make([]int, NUMBER_OF_SIMULATIONS)     for i, c := range tests {         start := (NUMBER_OF_SIMULATIONS/nCPU) * i         stop := (NUMBER_OF_SIMULATIONS/nCPU) * (i+1)         copy(results[start:stop], <-c)     }     fmt.Println("Successful interactions: ", len(results))     fmt.Println(time.Since(start)) }

The output is I received:

 > Number of CPUs:  2  > > Successful interactions:  1000  > 1m20.39959s > > Successful interactions:  1000 > 41.392299s

129

answered Oct 06 '22 21:10

the system

Testing your code on my Linux quad core i7 laptop I get this

Here is a Google Spreadsheet

Screenshot of Google Spreadsheet

This shows that under Linux at least the scaling is very nearly linear per core.

I think there may be two reasons why you aren't seeing this.

The first is that your macbook air only has 2 real cores. It has 4 hyperthreads though which is why it reports 4 as max cpus. A hyperthread typically only gives an extra 15% performance over a single core rather than the 100% you might expect. So stick to benchmarking 1 or 2 CPUs only on the macbook air!

The other reason might be OS X thread performance compared to Linux. They use different threading models which may be affecting performance.

answered Oct 06 '22 21:10

Nick Craig-Wood

Related questions
                            
                                Programmatically determine which Java thread holds a lock
                            
                                Task.WhenAll result ordering
                            
                                How does one start a thread in Clojure?
                            
                                Fast and Best Producer/consumer queue technique BlockingCollection vs concurrent Queue [closed]
                            
                                quartz: preventing concurrent instances of a job in jobs.xml
                            
                                How can weakCompareAndSet fail spuriously if it is implemented exactly like compareAndSet?
                            
                                How to reduce java concurrent mode failure and excessive gc
                            
                                What exactly does the final keyword guarantee regarding concurrency?
                            
                                How to atomically update a maximum value?
                            
                                When should one use the Actor model?
                            
                                HandlerThread vs Executor - When is one more appropriate over the other?
                            
                                Memory effects of synchronization in Java
                            
                                How to properly catch RuntimeExceptions from Executors?
                            
                                Explain "Leader/Follower" Pattern
                            
                                ExecutorCompletionService? Why do need one if we have invokeAll?
                            
                                Java threads and number of cores
                            
                                Replacing ASP.Net's session entirely
                            
                                @GuardedBy annotation with java.util.concurrent.locks.ReadWriteLock
                            
                                Why don't I see pipe operators in most high-level languages?
                            
                                Ways to improve performance consistency

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With