I am currently trying to implement a PNG encoder in C++ based on libpng that uses OpenMP to speed up the compression process. The tool is already able to generate PNG files from various image formats. I uploaded the complete source code to pastebin.com so you can see what I have done so far: http://pastebin.com/8wiFzcgV
So far, so good! Now, my problem is to find a way how to parallelize the generation of the IDAT chunks containing the compressed image data. Usually, the libpng function png_write_row gets called in a for-loop with a pointer to the struct that contains all the information about the PNG file and a row pointer with the pixel data of a single image row.
(Line 114-117 in the Pastebin file)
//Loop through image
for (i = 0, rp = info_ptr->row_pointers; i < png_ptr->height; i++, rp++) {
png_write_row(png_ptr, *rp);
}
Libpng then compresses one row after another and fills an internal buffer with the compressed data. As soon as the buffer is full, the compressed data gets flushed in a IDAT chunk to the image file.
My approach was to split the image into multiple parts and let one thread compress row 1 to 10 and another thread 11 to 20 and so on. But as libpng is using an internal buffer it is not as easy as I thought first :) I somehow have to make libpng write the compressed data to a separate buffer for every thread. Afterwards I need a way to concatenate the buffers in the right order so I can write them all together to the output image file.
So, does someone have an idea how I can do this with OpenMP and some tweaking to libpng? Thank you very much!
I finally got it to parallelize the compression process. As mentioned by Cameron in the comment to his answer I had to strip the zlib header from the zstreams to combine them. Stripping the footer was not required as zlib offers an option called Z_SYNC_FLUSH which can be used for all chunks (except the last one which has to be written with Z_FINISH) to write to a byte boundary. So you can simply concatenate the stream outputs afterwards. Eventually, the adler32 checksum has to be calculated over all threads and copied to the end of the combined zstreams.
If you are interested in the result you can find the complete proof of concept at https://github.com/anvio/png-parallel
This is too long for a comment but is not really an answer either--
I'm not sure you can do this without modifying libpng (or writing your own encoder). In any case, it will help if you understand how PNG compression is implemented:
At the high level, the image is a set of rows of pixels (generally 32-bit values representing RGBA tuples).
Each row can independently have a filter applied to it -- the filter's sole purpose is to make the row more "compressible". For example, the "sub" filter makes each pixel's value the difference between it and the one to its left. This delta encoding might seem silly at first glance, but if the colours between adjacent pixels are similar (which tends to be the case) then the resulting values are very small regardless of the actual colours they represent. It's easier to compress such data because it's much more repetitive.
Going down a level, the image data can be seen as a stream of bytes (rows are no longer distinguished from each other). These bytes are compressed, yielding another stream of bytes. The compressed data is arbitrarily broken up into segments (anywhere you want!) written to one IDAT chunk each (along with a little bookkeeping overhead per chunk, including a CRC checksum).
The lowest level brings us to the interesting part, which is the compression step itself. The PNG format uses the zlib compressed data format. zlib itself is just a wrapper (with more bookkeeping, including an Adler-32 checksum) around the real compressed data format, deflate (zip files use this too). deflate supports two compression techniques: Huffman coding (which reduces the number of bits required to represent some byte-string to the optimal number given the frequency that each different byte occurs in the string), and LZ77 encoding (which lets duplicate strings that have already occurred be referenced instead of written to the output twice).
The tricky part about parallelizing deflate compression is that in general, compressing one part of the input stream requires that the previous part also be available in case it needs to be referenced. But, just like PNGs can have multiple IDAT chunks, deflate is broken up into multiple "blocks". Data in one block can reference previously encoded data in another block, but it doesn't have to (of course, it may affect the compression ratio if it doesn't).
So, a general strategy for parallelizing deflate would be to break the input into multiple large sections (so that the compression ratio stays high), compress each section into a series of blocks, then glue the blocks together (this is actually tricky since blocks don't always end on a byte boundary -- but you can put an empty non-compressed block (type 00), which will align to a byte boundary, in-between sections). This isn't trivial, however, and requires control over the very lowest level of compression (creating deflate blocks manually), creating the proper zlib wrapper spanning all the blocks, and stuffing all this into IDAT chunks.
If you want to go with your own implementation, I'd suggest reading my own zlib/deflate implementation (and how I use it) which I expressly created for compressing PNGs (it's written in Haxe for Flash but should be comparatively easy to port to C++). Since Flash is single-threaded, I don't do any parallelization, but I do split the encoding up into virtually independent sections ("virtually" because there's the fractional-byte state preserved between sections) over multiple frames, which amounts to largely the same thing.
Good luck!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With