I've been playing around with using assembly language in Go and I've written a Hamming Weight function as an excercise.
I've based a native Go version on this SO answer and the assembly version is based on this doc from AMD (page 180). Upon benchmarking the two functions, I find that the native Go version is around 1.5x - 2x faster than the assembly version, despite the hand-written assembly version being almost identical to the output from go tool 6g -S popcount.go
.
output from go test -bench=.
PASS
BenchmarkPopCount 100000000 19.4 ns/op
BenchmarkPopCount_g 200000000 8.97 ns/op
ok popcount 4.777s
popcount.go
package popcount
func popCount(i uint32) uint32 // Defined in popcount_amd64.s
func popCount_g(i uint32) uint32 {
i = i - ((i >> 1) & 0x55555555)
i = (i & 0x33333333) + ((i >> 2) & 0x33333333)
return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24
}
popcount_test.go
package popcount
import "testing"
func TestPopcount(t *testing.T) {
for i := uint32(0); i < uint32(100); i++ {
if popCount(i) != popCount_g(i) {
t.Fatalf("failed on input = %v", i)
}
}
}
func BenchmarkPopCount(b *testing.B) {
for i := 0; i < b.N; i++ {
popCount(uint32(i))
}
}
func BenchmarkPopCount_g(b *testing.B) {
for i := 0; i < b.N; i++ {
popCount_g(uint32(i))
}
}
popcount_amd64.s
// func popCount(i uint32) uint32
TEXT ·popCount(SB),$0
MOVL i+0(FP), BP // i
MOVL BP, BX // i
SHRL $1, BX // i >> 1
ANDL $0x055555555, BX // (i >> 1) & 0x55555555
SUBL BX, BP // w = i - ((i >> 1) & 0x55555555)
MOVL BP, AX // w
SHRL $2, BP // w >> 2
ANDL $0x033333333, AX // w & 0x33333333
ANDL $0x033333333, BP // (w >> 2) & 0x33333333
ADDL BP, AX // x = (w & 0x33333333) + ((w >> 2) & 0x33333333)
MOVL AX, BX // x
SHRL $4, BX // x >> 4
ADDL AX, BX // x + (x >> 4)
ANDL $0x00F0F0F0F, BX // y = (x + (x >> 4) & 0x0F0F0F0F)
IMULL $0x001010101, BX // y * 0x01010101
SHRL $24, BX // population count = (y * 0x01010101) >> 24
MOVL BX, toReturn+8(FP) // Store result.
RET // return
output from go tool 6g -S popcount.go
"".popCount_g t=1 size=64 value=0 args=0x10 locals=0
000000 00000 (popcount.go:5) TEXT "".popCount_g+0(SB),4,$0-16
000000 00000 (popcount.go:5) NOP ,
000000 00000 (popcount.go:5) NOP ,
000000 00000 (popcount.go:5) MOVL "".i+8(FP),BP
0x0004 00004 (popcount.go:5) FUNCDATA $2,gclocals┬À9308e7ef08d2cc2f72ae1228688dacf9+0(SB)
0x0004 00004 (popcount.go:5) FUNCDATA $3,gclocals┬À3280bececceccd33cb74587feedb1f9f+0(SB)
0x0004 00004 (popcount.go:6) MOVL BP,BX
0x0006 00006 (popcount.go:6) SHRL $1,BX
0x0008 00008 (popcount.go:6) ANDL $1431655765,BX
0x000e 00014 (popcount.go:6) SUBL BX,BP
0x0010 00016 (popcount.go:7) MOVL BP,AX
0x0012 00018 (popcount.go:7) ANDL $858993459,AX
0x0017 00023 (popcount.go:7) SHRL $2,BP
0x001a 00026 (popcount.go:7) ANDL $858993459,BP
0x0020 00032 (popcount.go:7) ADDL BP,AX
0x0022 00034 (popcount.go:8) MOVL AX,BX
0x0024 00036 (popcount.go:8) SHRL $4,BX
0x0027 00039 (popcount.go:8) ADDL AX,BX
0x0029 00041 (popcount.go:8) ANDL $252645135,BX
0x002f 00047 (popcount.go:8) IMULL $16843009,BX
0x0035 00053 (popcount.go:8) SHRL $24,BX
0x0038 00056 (popcount.go:8) MOVL BX,"".~r1+16(FP)
0x003c 00060 (popcount.go:8) RET ,
I know from here that the FUNCDATA
lines contain information for the garbage collector, but other than that I don't see any glaring differences.
What could be causing this large difference in speed between the 2 functions?
It converts the whole code into machine language at a time. But the Assembler can't do this at once. It takes less execution time compared to an assembler. It takes more time than a compiler.
Actually, the short answer is: Assembler is always faster or equal to the speed of C. The reason is that you can have assembly without C, but you can't have C without assembly (in the binary form, which we in the old days called "machine code").
The difference between compiler and assembler is that a compiler is used to convert high-level programming language code into machine language code. On the other hand, an assembler converts assembly level language code into machine language code. Both these terms are relevant in context to program execution.
The execution speed of assembly language is not as fast as machine language. The commands need to be translated to be understood by the processor. The execution speed of a compiler is slower than that of an assembler.
If you look at the pseudo-assembler for the funnction calls you will see that the .s
(assembler) version is called, with call overhead, and the .go
version is inlined.
func S() {
pc := popCount(uint32(0))
_ = pc
}
"".S t=1 size=48 value=0 args=0x0 locals=0x10
0x0000 00000 (popcount.go:11) TEXT "".S+0(SB),$16-0
0x0000 00000 (popcount.go:11) MOVQ (TLS),CX
0x0009 00009 (popcount.go:11) CMPQ SP,(CX)
0x000c 00012 (popcount.go:11) JHI ,21
0x000e 00014 (popcount.go:11) CALL ,runtime.morestack00_noctxt(SB)
0x0013 00019 (popcount.go:11) JMP ,0
0x0015 00021 (popcount.go:11) SUBQ $16,SP
0x0019 00025 (popcount.go:11) FUNCDATA $2,gclocals·3280bececceccd33cb74587feedb1f9f+0(SB)
0x0019 00025 (popcount.go:11) FUNCDATA $3,gclocals·3280bececceccd33cb74587feedb1f9f+0(SB)
0x0019 00025 (popcount.go:12) MOVL $0,(SP)
0x0020 00032 (popcount.go:12) PCDATA $1,$0
0x0020 00032 (popcount.go:12) CALL ,"".popCount(SB)
0x0025 00037 (popcount.go:12) MOVL 8(SP),BX
0x0029 00041 (popcount.go:12) NOP ,
0x0029 00041 (popcount.go:14) ADDQ $16,SP
0x002d 00045 (popcount.go:14) RET ,
func S_G() {
pc := popCount_g(uint32(0))
_ = pc
}
"".S_G t=1 size=64 value=0 args=0x0 locals=0x8
0x0000 00000 (popcount.go:16) TEXT "".S_G+0(SB),4,$8-0
0x0000 00000 (popcount.go:16) SUBQ $8,SP
0x0004 00004 (popcount.go:16) FUNCDATA $2,gclocals·3280bececceccd33cb74587feedb1f9f+0(SB)
0x0004 00004 (popcount.go:16) FUNCDATA $3,gclocals·3280bececceccd33cb74587feedb1f9f+0(SB)
0x0004 00004 (popcount.go:17) MOVL $0,BP
0x0006 00006 (popcount.go:17) MOVL BP,BX
0x0008 00008 (popcount.go:17) SHRL $1,BX
0x000a 00010 (popcount.go:17) ANDL $1431655765,BX
0x0010 00016 (popcount.go:17) SUBL BX,BP
0x0012 00018 (popcount.go:17) MOVL BP,AX
0x0014 00020 (popcount.go:17) ANDL $858993459,AX
0x0019 00025 (popcount.go:17) SHRL $2,BP
0x001c 00028 (popcount.go:17) ANDL $858993459,BP
0x0022 00034 (popcount.go:17) ADDL BP,AX
0x0024 00036 (popcount.go:17) MOVL AX,BX
0x0026 00038 (popcount.go:17) SHRL $4,BX
0x0029 00041 (popcount.go:17) ADDL AX,BX
0x002b 00043 (popcount.go:17) ANDL $252645135,BX
0x0031 00049 (popcount.go:17) IMULL $16843009,BX
0x0037 00055 (popcount.go:17) SHRL $24,BX
0x003a 00058 (popcount.go:19) ADDQ $8,SP
0x003e 00062 (popcount.go:19) RET ,
If you actually want a popcnt, it's been a native instruction on X86 processors for a while now. Your compiler may have an intrinsic for it, such as Microsoft's __popcnt() (also __popcnt16() and __popcnt64() ), or GCC's __builtin_popcount() (or __builtin_popcountll() for 64 bit).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With