Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient number of goroutines on this machine

So I do have a concurrent quicksort implementation written by me. It looks like this:

func Partition(A []int, p int, r int) int {
    index := MedianOf3(A, p, r)
    swapArray(A, index, r)
    x := A[r]
    j := p - 1
    i := p
    for i < r {
        if A[i] <= x {
            j++
            tmp := A[j]
            A[j] = A[i]
            A[i] = tmp
        }
        i++
    }
    swapArray(A, j+1, r)
    return j + 1
}


func ConcurrentQuicksort(A []int, p int, r int) {
    wg := sync.WaitGroup{}
    if p < r {
        q := Partition(A, p, r)
        select {
        case sem <- true:
            wg.Add(1)
            go func() {
                ConcurrentQuicksort(A, p, q-1)
                <-sem
                wg.Done()
            }()
        default:
            Quicksort(A, p, q-1)
        }
        select {
        case sem <- true:
            wg.Add(1)
            go func() {
                ConcurrentQuicksort(A, q+1, r)
                <-sem
                wg.Done()
            }()
        default:
            Quicksort(A, q+1, r)
        }
    }
    wg.Wait()
}

func Quicksort(A []int, p int, r int) {
    if p < r {
        q := Partition(A, p, r)
        Quicksort(A, p, q-1)
        Quicksort(A, q+1, r)
    }
}

I have a sem buffered channel, which I use to limit the number of goroutines running (if its reaches that number, I dont set up another goroutine, I just do the normal quicksort on the subarray). First I started with 100, then I've changed to 50, 20. The benchmarks would get slightly better. But after switching to 10, it started to go back, times started to get bigger. So there is some arbitrary number, at least for my hardware, that makes the algorithm run most efficient.

When I was implementing this, I actually saw some SO question about the number of goroutines that would be the best and now I cannot find it (stupid Chrome history actually saves not all visited sites). Do you know how to calculate such a things? And it would be the best if I didn't have to hardcode it, just let the program do it itself.

P.S I have nonconcurrent Quicksort, which runs about 1.7x slower than this. As you can see in my code, I do Quicksort, when the number of running goroutines exceeds the number I've set up earlier. I thought what about using a ConcurrentQuicksort, but not calling it with go keyword, just simply calling it, and maybe if other goroutines finish their job, the ConcurrentQuicksort which I called would start to launch up goroutines, speeding up the process (cuz as you can see Quicksort would only launch recursive quicksorts, without goroutines). I did that, and actually the time was like 10% slower than the regular Quicksort. Do you know why would that happen?

like image 461
minecraftplayer1234 Avatar asked Jan 30 '23 03:01

minecraftplayer1234


1 Answers

You have to experiment a bit with this stuff, but I don't think the main concern is goroutines running at once. As the answer @reticentroot linked to says, it's not necessarily a problem to run a lot of simultaneous goroutines.

I think your main concern should be total number of goroutine launches. The current implementation could theoretically start a goroutine to sort just a few items, and that goroutine would spend a lot more time on startup/coordination than actual sorting.

The ideal is you only start as many goroutines as you need to get good utilization of all your CPUs. If your work items are ~equal size and your cores are ~equally busy, then starting one task per core is perfect.

Here, tasks aren't evenly sized, so you might split the sort into somewhat more tasks than you have CPUs and distribute them. (In production you would typically use a worker pool to distribute work without starting a new goroutine for every task, but I think we can get away with skipping that here.)

To get a workable number of tasks--enough to keep all cores busy, but not so many that you create lots of overhead--you can set a minimum size (initial array size/100 or whatever), and only split off sorts of arrays larger than that.


In slightly more detail, there is a bit of cost every time you send a task off to the background. For starters:

  • Each goroutine launch spends a little time setting up the stack and doing scheduler bookkeeping
  • Each task switch spends some time in the scheduler and may incur cache misses when the two goroutines are looking at different code or data
  • Your own coordination code (channel sends and sync ops) takes time

Other things can prevent ideal speedups from happening: you could hit a systemwide limit on e.g. memory bandwidth as Volker pointed out, some sync costs can increase as you add cores, and you can run into various trickier issues sometimes. But the setup, switching, and coordination costs are a good place to start.

The benefit that can outweigh the coordination costs is, of course, other CPUs getting work done when they'd otherwise sit idle.

I think, but haven't tested, that your problems at 50 goroutines are 1) you already reached nearly-full utilization long ago, so adding more tasks adds more coordination work without making things go faster, and 2) you're creating goroutines for tiny sorts, which may spend more of their time setting up and coordinating than they actually do sorting. And at 10 goroutines your problem might be that you're no longer achieving full CPU utilization.

If you wanted, you could test those theories by counting the number of total goroutine launches at various goroutine limits (in an atomic global counter) and measuring CPU utilization at various limits (e.g. by running your program under the Linux/UNIX time utility).

The approach I'd suggest for a divide-and-conquer problem like this is only fork off a goroutine for large enough subproblems (for quicksort, that means large enough subarrays). You can try different limits: maybe you only start goroutines for pieces that are more than 1/64th of the original array, or pieces above some static threshold like 1000 items.


And you meant this sort routine as an exercise, I suspect, but there are various things you can do to make your sorts faster or more robust against weird inputs. The standard libary sort falls back to insertion sort for small subarrays and uses heapsort for the unusual data patterns that cause quicksort problems.

You can also look at other algorithms like radix sort for all or part of the sorting, which I played with. That sorting library is also parallel. I wound up using a minimum cutoff of 127 items before I'd hand a subarray off for other goroutines to sort, and I used an arrangement with a fixed pool of goroutines and a buffered chan to pass tasks between them. That produced decent practical speedups at the time, though it was likely not the best approach at the time and I'm almost sure it's not on today's Go scheduler. Experimentation is fun!

like image 175
twotwotwo Avatar answered Feb 02 '23 08:02

twotwotwo