I've got a bit of Go code that I've been tinkering with to answer a little curiosity of mine related to a video game my brother-in-law plays.
Essentially, the code below simulates interactions with monsters in the game and how often he can expect them to drop items upon their defeat. The problem I'm having is that I would expect a piece of code like this to be perfect for parallelization, but when I add in concurrency the time it takes to do all of the simulations tends to slow down by 4-6 times the original without concurrency.
To give you a better understanding of how the code works, I have three main functions: The interaction function which is a simple interaction between the player and a monster. It returns 1 if the monster drops an item, and 0 otherwise. The simulation function runs several interactions and returns a slice of interaction results (i.e., 1's and 0's representing successful/unsuccessful interactions). Finally, there is the test function which runs a set of simulations and returns a slice of simulation results which are the total number of interactions that resulted in a dropped item. It's the last function which I am trying to run in parallel.
Now, I could understand why the code would slow down if I created a goroutine for each test that I want to run. Assuming I'm running 100 tests, the context switching between each of the goroutines across the 4 CPUs my MacBook Air has would kill the performance, but I'm only creating as many goroutines as I have processors and dividing the number of tests between the goroutines. I would expect this to actually speed up the code's performance since I am running each of my tests in parallel, but, of course, I'm getting a major slow down instead.
I'd love to figure out why this is happening, so any help would be greatly appreciated.
Below is the regular code without the go routines:
package main import ( "fmt" "math/rand" "time" ) const ( NUMBER_OF_SIMULATIONS = 1000 NUMBER_OF_INTERACTIONS = 1000000 DROP_RATE = 0.0003 ) /** * Simulates a single interaction with a monster * * Returns 1 if the monster dropped an item and 0 otherwise */ func interaction() int { if rand.Float64() <= DROP_RATE { return 1 } return 0 } /** * Runs several interactions and retuns a slice representing the results */ func simulation(n int) []int { interactions := make([]int, n) for i := range interactions { interactions[i] = interaction() } return interactions } /** * Runs several simulations and returns the results */ func test(n int) []int { simulations := make([]int, n) for i := range simulations { successes := 0 for _, v := range simulation(NUMBER_OF_INTERACTIONS) { successes += v } simulations[i] = successes } return simulations } func main() { rand.Seed(time.Now().UnixNano()) fmt.Println("Successful interactions: ", test(NUMBER_OF_SIMULATIONS)) }
And, here is the concurrent code with the goroutines:
package main import ( "fmt" "math/rand" "time" "runtime" ) const ( NUMBER_OF_SIMULATIONS = 1000 NUMBER_OF_INTERACTIONS = 1000000 DROP_RATE = 0.0003 ) /** * Simulates a single interaction with a monster * * Returns 1 if the monster dropped an item and 0 otherwise */ func interaction() int { if rand.Float64() <= DROP_RATE { return 1 } return 0 } /** * Runs several interactions and retuns a slice representing the results */ func simulation(n int) []int { interactions := make([]int, n) for i := range interactions { interactions[i] = interaction() } return interactions } /** * Runs several simulations and returns the results */ func test(n int, c chan []int) { simulations := make([]int, n) for i := range simulations { for _, v := range simulation(NUMBER_OF_INTERACTIONS) { simulations[i] += v } } c <- simulations } func main() { rand.Seed(time.Now().UnixNano()) nCPU := runtime.NumCPU() runtime.GOMAXPROCS(nCPU) fmt.Println("Number of CPUs: ", nCPU) tests := make([]chan []int, nCPU) for i := range tests { c := make(chan []int) go test(NUMBER_OF_SIMULATIONS/nCPU, c) tests[i] = c } // Concatentate the test results results := make([]int, NUMBER_OF_SIMULATIONS) for i, c := range tests { start := (NUMBER_OF_SIMULATIONS/nCPU) * i stop := (NUMBER_OF_SIMULATIONS/nCPU) * (i+1) copy(results[start:stop], <-c) } fmt.Println("Successful interactions: ", results) }
UPDATE (01/12/13 18:05)
I've added a new version of the concurrent code below that creates a new Rand instance for each goroutine per "the system"'s suggestion below. I'm now seeing a very slight speed up compared to the serial version of the code (around a 15-20% reduction in overall time taken). I'd love to know why I don't see something closer to a 75% reduction in time since I'm spreading the workload over my MBA's 4 cores. Does anyone have any further suggestions that could help out?
package main import ( "fmt" "math/rand" "time" "runtime" ) const ( NUMBER_OF_SIMULATIONS = 1000 NUMBER_OF_INTERACTIONS = 1000000 DROP_RATE = 0.0003 ) /** * Simulates a single interaction with a monster * * Returns 1 if the monster dropped an item and 0 otherwise */ func interaction(generator *rand.Rand) int { if generator.Float64() <= DROP_RATE { return 1 } return 0 } /** * Runs several interactions and retuns a slice representing the results */ func simulation(n int, generator *rand.Rand) []int { interactions := make([]int, n) for i := range interactions { interactions[i] = interaction(generator) } return interactions } /** * Runs several simulations and returns the results */ func test(n int, c chan []int) { source := rand.NewSource(time.Now().UnixNano()) generator := rand.New(source) simulations := make([]int, n) for i := range simulations { for _, v := range simulation(NUMBER_OF_INTERACTIONS, generator) { simulations[i] += v } } c <- simulations } func main() { rand.Seed(time.Now().UnixNano()) nCPU := runtime.NumCPU() runtime.GOMAXPROCS(nCPU) fmt.Println("Number of CPUs: ", nCPU) tests := make([]chan []int, nCPU) for i := range tests { c := make(chan []int) go test(NUMBER_OF_SIMULATIONS/nCPU, c) tests[i] = c } // Concatentate the test results results := make([]int, NUMBER_OF_SIMULATIONS) for i, c := range tests { start := (NUMBER_OF_SIMULATIONS/nCPU) * i stop := (NUMBER_OF_SIMULATIONS/nCPU) * (i+1) copy(results[start:stop], <-c) } fmt.Println("Successful interactions: ", results) }
UPDATE (01/13/13 17:58)
Thanks everyone for the help in figuring out my problem. I did finally get the answer I was looking for and so I thought I would just summarize here for anyone who has the same problem.
Essentially I had two main issues: first, even though my code was embarrassingly parallel, it was running slower when I split it up amongst the available processors, and second, the solution opened up another issue, which was my serial code was running twice as slow as the concurrent code running on single processor, which you would expect to be roughly the same . In both cases the issue was the random number generator function rand.Float64
. Basically, this is a convenience function provided by the rand
package. In that package, a global instance of the Rand
struct is created and used by each of the convenience functions. This global Rand
instance has a mutex lock associated with it. Since I was using this convenience function, I wasn't truly able to parallelize my code since each of the goroutines would have to line up for access to the global Rand
instance. The solution (as "the system" suggests below) is to create a separate instance of the Rand
struct for each goroutine. This solved the first problem but created the second one.
The second problem was that my non-parallel concurrent code (i.e., my concurrent code running with only a single processor) was running twice as fast as the sequential code. The reason for this was that, even though I was only running with a single processor and a single goroutine, that goroutine had its own instance of the Rand
struct that I had created, and I had created it without the mutex lock. The sequential code was still using the rand.Float64
convenience function which made use of the global mutex protected Rand
instance. The cost of acquiring that lock was causing the sequential code to run twice as slow.
So, the moral of the story is, whenever performance matters, make sure you create an instance of the Rand
struct and call the function you need off of it rather than using the convenience functions provided by the package.
In Go, concurrency works through the use of in-built functions known as Goroutines. Goroutines are functions, unique to Go, that run at the same time alongside other code or programs. They're not OS threads, though, they may be considered lightweight threads. Goroutines are deeply integrated with Go's runtime.
Concurrency in Golang is cheap and easy. Goroutines are cheap, lightweight threads. Channels, are the conduits that allow for communication between goroutines. Communicating Sequential Processes, or CSP for short, is used to describe how systems that feature multiple concurrent models should interact with one another.
Concurrency is the task of running and managing the multiple computations at the same time. While parallelism is the task of running multiple computations simultaneously.
You create Goroutines to share data via channels. And because there is no need for exclusive access to global data structures, you gain speed.
The issue seems to come from your use of rand.Float64()
, which uses a shared global object with a Mutex lock on it.
Instead, if for each CPU you create a separate rand.New()
, pass it through to the interactions()
, and use it to create the Float64()
, there's a massive improvement.
Update to show the changes to the new example code in the question that now uses rand.New()
The test()
function was modified to either use a given channel, or return the result.
func test(n int, c chan []int) []int { source := rand.NewSource(time.Now().UnixNano()) generator := rand.New(source) simulations := make([]int, n) for i := range simulations { for _, v := range simulation(NUMBER_OF_INTERACTIONS, generator) { simulations[i] += v } } if c == nil { return simulations } c <- simulations return nil }
The main()
function was updated to run both tests, and output the timed result.
func main() { rand.Seed(time.Now().UnixNano()) nCPU := runtime.NumCPU() runtime.GOMAXPROCS(nCPU) fmt.Println("Number of CPUs: ", nCPU) start := time.Now() fmt.Println("Successful interactions: ", len(test(NUMBER_OF_SIMULATIONS, nil))) fmt.Println(time.Since(start)) start = time.Now() tests := make([]chan []int, nCPU) for i := range tests { c := make(chan []int) go test(NUMBER_OF_SIMULATIONS/nCPU, c) tests[i] = c } // Concatentate the test results results := make([]int, NUMBER_OF_SIMULATIONS) for i, c := range tests { start := (NUMBER_OF_SIMULATIONS/nCPU) * i stop := (NUMBER_OF_SIMULATIONS/nCPU) * (i+1) copy(results[start:stop], <-c) } fmt.Println("Successful interactions: ", len(results)) fmt.Println(time.Since(start)) }
The output is I received:
> Number of CPUs: 2 > > Successful interactions: 1000 > 1m20.39959s > > Successful interactions: 1000 > 41.392299s
Testing your code on my Linux quad core i7 laptop I get this
Here is a Google Spreadsheet
This shows that under Linux at least the scaling is very nearly linear per core.
I think there may be two reasons why you aren't seeing this.
The first is that your macbook air only has 2 real cores. It has 4 hyperthreads though which is why it reports 4 as max cpus. A hyperthread typically only gives an extra 15% performance over a single core rather than the 100% you might expect. So stick to benchmarking 1 or 2 CPUs only on the macbook air!
The other reason might be OS X thread performance compared to Linux. They use different threading models which may be affecting performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With