Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to properly use carry-less multiplication assembly (PCLMULQDQ) in zlib CRC32?

I've recently been playing around with CloudFlare's optimized zlib, and the results are really quite impressive.

Unfortunately, they seem to have assumed development of zlib was abandoned, and their fork broke away. I was eventually able to manually rebase their changes on to the current zlib development branch, though it was a real pain in the ass.

Anyway, there's still one major optimization in the CloudFlare code I haven't been able to utilize, namely, the fast CRC32 code implemented with the PCLMULQDQ carry-less multiplication instructions included with newer (Haswell and later, I believe) Intel processors, because:

  1. I'm on a Mac, and neither the clang integrated assembler nor Apple's ancient GAS understand the newer GAS mnemonics used, and

  2. The code was lifted from the Linux kernel and is GPL2, which makes the whole library GPL2, and thereby basically renders it useless for my purposes.

So I did some hunting around, and after a few hours I stumbled onto some code that Apple is using in their bzip2: handwritten, vectorized CRC32 implementations for both arm64 and x86_64.

Bizarrely, the comments for the x86_64 assembly are (only) in the arm64 source, but it does seem to indicate that this code could be used with zlib:

This function SHOULD NOT be called directly. It should be called in a wrapper
function (such as crc32_little in crc32.c) that 1st align an input buffer to 16-byte (update crc along the way),
and make sure that len is at least 16 and SHOULD be a multiple of 16.

But I unfortunately, after a few attempts, at this point I seem to be in a bit over my head. And I'm not sure how to actually do that. So I was hoping someone could show me how/where one would call the function provided.

(It also would be fantastic if there were a way to do it where the necessary features were detected at runtime, and could fall back to the software implementation if the hardware features are unavailable, so I wouldn't have to distribute multiple binaries. But, at the very least, if anyone could help me suss out how to get the library to correctly use the Apple PCLMULQDQ-based CRC32, that would go a long way, regardless.)

like image 290
Geoff Nixon Avatar asked May 22 '16 11:05

Geoff Nixon


1 Answers

As it says, you need to calculate the CRC sum on a 16-byte aligned buffer that has length of multiple of 16 bytes. Thus you'd cast the current buffer pointer as uintptr_t and for as long as its 4 LSB bits are not zero, you increase the pointer feeding the bytes into an ordinary CRC-32 routine. Once you've at 16-byte aligned address, you round the remaining length down to multiple of 16, then feed these bytes to the fast CRC-32, and again the remaining bytes to the slow calculation.


Something like:

// a function for adding a single byte to crc
uint32_t crc32_by_byte(uint32_t crc, uint8_t byte);

// the assembly routine
uint32_t _crc32_vec(uint32_t crc, uint8_t *input, int length);

uint32_t crc = initial_value;
uint8_t *input = whatever;
int length = whatever; // yes, the assembly uses *int* length.

assert(length >= 32); // if length is less than 32 just calculate byte by byte
while ((uintptr_t)input & 0xf) { // for as long as input is not 16-byte aligned
    crc = crc32_by_byte(crc, *input++);
    length--;
}

// input is now 16-byte-aligned
// floor length to multiple of 16
int fast_length = (length >> 4) << 4;
crc = _crc32_vec(crc, input, fast_length);

// do the remaining bytes
length -= fast_length;
while (length--) {
    crc = crc32_by_byte(crc, *input++)
}
return crc;