In Go, do non-capturing closures harm performance?

Question

For instance, github.com/yhat/scrape suggests using a closure like this:

func someFunc() {
    ...
    matcher := func(n *html.Node) bool {
        return n.DataAtom == atom.Body
    }
    body, ok := scrape.Find(root, matcher)
    ...
}

Since matcher doesn’t actually capture any local variables, this could equivalently be written as:

func someFunc() {
    ...
    body, ok := scrape.Find(root, matcher)
    ...
}

func matcher(n *html.Node) bool {
    return n.DataAtom == atom.Body
}

The first form looks better, because the matcher function is quite specific to that place in the code. But does it perform worse at runtime (assuming someFunc may be called often)?

I guess there must be some overhead to creating a closure, but this kind of closure could be optimized into a regular function by the compiler?

(Obviously the language spec doesn’t require this; I’m interested in what gc actually does.)

Vasiliy Faronov · Accepted Answer

It seems like there is no difference. We can check in the generated machine code.

Here is a toy program:

package main

import "fmt"

func topLevelFunction(x int) int {
    return x + 4
}

func useFunction(fn func(int) int) {
    fmt.Println(fn(10))
}

func invoke() {
    innerFunction := func(x int) int {
        return x + 8
    }
    useFunction(topLevelFunction)
    useFunction(innerFunction)
}

func main() {
    invoke()
}

And here is its disassembly:

$ go version
go version go1.8.5 linux/amd64

$ go tool objdump -s 'main\.(invoke|topLevel)' bin/toy 
TEXT main.topLevelFunction(SB) /home/vasiliy/cur/work/learn-go/src/my/toy/toy.go
    toy.go:6    0x47b7a0    488b442408  MOVQ 0x8(SP), AX    
    toy.go:6    0x47b7a5    4883c004    ADDQ $0x4, AX       
    toy.go:6    0x47b7a9    4889442410  MOVQ AX, 0x10(SP)   
    toy.go:6    0x47b7ae    c3      RET         

TEXT main.invoke(SB) /home/vasiliy/cur/work/learn-go/src/my/toy/toy.go
    toy.go:13   0x47b870    64488b0c25f8ffffff  FS MOVQ FS:0xfffffff8, CX       
    toy.go:13   0x47b879    483b6110        CMPQ 0x10(CX), SP           
    toy.go:13   0x47b87d    7638            JBE 0x47b8b7                
    toy.go:13   0x47b87f    4883ec10        SUBQ $0x10, SP              
    toy.go:13   0x47b883    48896c2408      MOVQ BP, 0x8(SP)            
    toy.go:13   0x47b888    488d6c2408      LEAQ 0x8(SP), BP            
    toy.go:17   0x47b88d    488d052cfb0200      LEAQ 0x2fb2c(IP), AX            
    toy.go:17   0x47b894    48890424        MOVQ AX, 0(SP)              
    toy.go:17   0x47b898    e813ffffff      CALL main.useFunction(SB)       
    toy.go:14   0x47b89d    488d0514fb0200      LEAQ 0x2fb14(IP), AX            
    toy.go:18   0x47b8a4    48890424        MOVQ AX, 0(SP)              
    toy.go:18   0x47b8a8    e803ffffff      CALL main.useFunction(SB)       
    toy.go:19   0x47b8ad    488b6c2408      MOVQ 0x8(SP), BP            
    toy.go:19   0x47b8b2    4883c410        ADDQ $0x10, SP              
    toy.go:19   0x47b8b6    c3          RET                 
    toy.go:13   0x47b8b7    e874f7fcff      CALL runtime.morestack_noctxt(SB)   
    toy.go:13   0x47b8bc    ebb2            JMP main.invoke(SB)         

TEXT main.invoke.func1(SB) /home/vasiliy/cur/work/learn-go/src/my/toy/toy.go
    toy.go:15   0x47b8f0    488b442408  MOVQ 0x8(SP), AX    
    toy.go:15   0x47b8f5    4883c008    ADDQ $0x8, AX       
    toy.go:15   0x47b8f9    4889442410  MOVQ AX, 0x10(SP)   
    toy.go:15   0x47b8fe    c3      RET

As we can see, at least in this simple case, there is no structural difference in how topLevelFunction and innerFunction (invoke.func1), and their passing to useFunction, are translated to machine code.

(It is instructive to compare this to the case where innerFunction does capture a local variable; and to the case where, moreover, innerFunction is passed via a global variable rather than a function argument — but these are left as an exercise to the reader.)

Seva · Answer

It generally should. And probably even more so with compiler optimization taken into account (as reasoning about a function is generally easier then about a closure, so I would expect a compiler to tend to optimize a function more often then an equivalent closure). But it is not exactly black and white as many factors may affect the final code produced, including your platform and version of the compiler itself. And more importantly, your other code will typically affect performance much more then speed of making a call (both algorithm wise and lines of code wise), which seems to be the point JimB made.

For example, I wrote following sample code and then benchmarked it.

var (
    test int64
)

const (
    testThreshold = int64(1000000000)
)

func someFunc() {
    test += 1
}

func funcTest(threshold int64) int64 {
    test = 0
    for i := int64(0); i < threshold; i++ {
        someFunc()
    }
    return test
}

func closureTest(threshold int64) int64 {
    someClosure := func() {
        test += 1
    }

    test = 0
    for i := int64(0); i < threshold; i++ {
        someClosure()
    }
    return test
}

func closureTestLocal(threshold int64) int64 {
    var localTest int64
    localClosure := func() {
        localTest += 1
    }

    localTest = 0
    for i := int64(0); i < threshold; i++ {
        localClosure()
    }
    return localTest
}

On my laptop, funcTest takes 2.0 ns per iteration, closureTest takes 2.2 ns and closureTestLocal takes 1.9ns. Here, closureTest vs funcTest appears confirming your (and mine) assumption that a closure call will be slower then a function call. But please note that those test functions were intentionally made simple and small to make call speed difference to stand out and it's still only 10% difference. In fact, checking compiler output shows that actually in funcTest case compiler did inline funcTest instead of calling it. So, I would expect the difference be even smaller if it didn't. But more importantly, I'd like to point out that closureTestLocal is 5% faster then the (inlined) function even though this one is actually a capturing closure. Please note that neither of the closures was inlined or optimized out - both closure tests faithfully make all the calls. The only difference I see in the compiled code for local closure case operates completely on the stack, while both other functions access a global variable (somewhere in memory) by it's address. But whilst I easily can reason about the difference by looking at the compiled code, my point is - it's not exactly black and white even in the simplest cases.

So, if speed is really that important in your case, I would suggest benchmarking it instead (and with actual code). You also could use go tool objdump to analyze actual code produced to get a clue where difference comes from. But as a rule of thumb, I would suggest to rather focus on writing better code (whatever that means for you) and ignore speed of actual calls (as in "avoid premature optimization").

In Go, do non-capturing closures harm performance?

Tags:

performance

closures

go

Vasiliy Faronov

2 Answers

Vasiliy Faronov

Seva

Recent Activity

Donate For Us

In Go, do non-capturing closures harm performance?

Tags:

performance

closures

go

Vasiliy Faronov

2 Answers

Vasiliy Faronov

Seva

Related questions

Recent Activity

Donate For Us