We have some really big repositories in git, in these we have observed how remote/server compression is a bottleneck when cloning/pulling. Given how pervasive git has become and that is uses zlib, has this zlib compression been optimized?
An Intel paper details how they can speedup the DEFLATE compression with a factor of about ~4 times although with a smaller compression ratio:
http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-deflate-compression-paper.pdf
Another paper indicates a speed up of ~1.8 times where compression ratios are preserved for most compression 'levels' (1-9):
http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/zlib-compression-whitepaper-copy.pdf
This latter optimization is it seems available on github: https://github.com/jtkukunas/zlib
zlib seems to be quite old (in this fast paced industry) latest release is from april, 2013. Have there been any attempts to SIMD optimize zlib for new processor generations? Or are there alternatives to using zlib in git?
I do understand you can specify a compression level in git that will impact speed and compression ratio. However, the above indicates there can be made quite big performance improvements on zlib without hurting compression ratios.
So to recap, are there any existing git implementation that uses a highly optimized zlib or zlib alternative?
PS: It seems a lot of devs/servers would benefit from this (even green house gas emission ;)).
There are in fact contributions to zlib's deflate from Intel that have yet to be integrated. You can look at this fork of zlib that has some experimental integrations of Intel and Cloudfare improvements to compression. You could try compiling that with git to see how it does.
zlib is older than you think. Most of the compression code is relatively unchanged from 20 years ago. The decompression was rewritten about 12 years ago.
I don't know of any git implementations using optimized zlib or alternatives. I've done a bit of investigation of compression and the tradeoffs between compression levels and speed however and if you are aiming to improve performance significantly you generally will have better results coming up with a new algorithm designed with speed in mind than trying to optimize an existing algorithm. LZ4 is a good example of a compression algorithm designed with speed as a priority over compression ratio.
The nature of compression algorithms means that they don't tend to parallelize or SIMDify (which is really a type of parallelism) very effectively, particularly if they were not designed with that as a goal. Compression by its very nature involves serial data dependencies on a stream.
Another thing to consider with compression algorithms is whether to prioritize compression or decompression speed. If your bottleneck is the time it takes the server to compress data then you want to focus on fast compression but in other situations where you compress once and decompress often (loading game assets or fetching a static web page for example) then you likely want to prioritize decompression speed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With