I have a speed critical program that will repeatedly reads images from disk and compute values from them. The images are too many to store in memory.
The same set of images will be read, we will not change/edit them, and their order is fix.
Not all the images have the same size, but they all have roughly 1 Mb when encoded as PNG. There are tens of thousands of them and most of the RAM is already used to store the computed values.
Other than buying faster disks, or using RAID, what is the fastest way to read a sequence of images ?
Would it be faster to have them all inside a large tar file (and read them with custom untaring code), instead of as individual files in a folder ?
I could not find a multi-threaded implementation of PNG decoding, so this stage may also become a bottleneck. Would using WebP instead of PNG provide an additional speed edge ?
Which other ideas should I consider/evaluate ?
Dear stack overflow community,
as promised here are the results of the experiments done based on your many suggestions. A special thanks to @user894763 how put me on the "right path".
tl;dr use pnm files inside an uncompressed tar (yes I said pnm !).
I have done experiments on two high end machines, one enabled with SSD disks, and the other one using a networked file system. Both have high end CPUs, but show "two side of the spectrum" on disk access. Surprisingly, the conclusions are the same for both machines. I report only one set of results (for the later case). The ratios amongst file formats are almost identical in both experiments.
From these experiments I have learned two important things:
I am reading a set of ~1200 images in a fix sequence, no computation is done on the images, I am simply measuring the time to load the pixels in memory. The tar files sizes are ~600 MB in pnm format, ~300 MB in png format, and ~200 MB in webp format.
"Fresh read" means first read done on the machine.
"Cached read" means the second read done on the same machine (and any subsequent one).
All numbers are roughly +- 10 Hz.
webp fresh read: 30 Hz
webp cached read: 80 Hz
webp + tar fresh read: 100 Hz
webp + tar cached read: 100 Hz
png fresh read: 50 Hz
png cached read: 165 Hz
png + tar fresh read: 200 Hz
png + tar cached read: 200 Hz
pnm fresh read: 50 Hz
pnm cached read: 600 Hz
pnm + tar fresh read: 200 Hz
pnm + tar cached read: 2300 Hz
I was told that maybe there is way to change the webp compression parameters to make the decompression faster. I suspect that it would still not match the pnm performance.
Please note that I used custom code to read the images in the tar file, the file is read from disk "image by image".
I do not know why reading the webp images "fresh" was slower than the png ones, I can only speculate that the networked disk system had some "internal" cache that somewhat changed the behaviour. However this does not affect the lessons.
If you will read a file (or a set of files) multiple times, the operative system disk cache will make all future reads essentially "as fast as reading from RAM".
Even when reading from disk the time to decompress images is non-negligible.
Putting all the files into single uncompressed (tar) file, makes things significantly faster because the operative system will assume that the whole file will be read, pre-loading future images even before we access them. This seem not to happen when simply reading inside a folder.
With proper care, a factor 4x ~ x10 in speed-up can be obtained when reading a sequence of images from disk (specially if read repeatedly).
PNG is not built for speed. It's slower than jpeg and no smaller than tif. If you're stuck with PNG, no other optimisations will make any difference.
For example:
$ time vips avg wtc.tif
117.853995
real 0m0.525s
user 0m0.756s
sys 0m0.580s
$ time vips avg wtc.png
117.853995
real 0m3.622s
user 0m3.984s
sys 0m0.584s
where "wtc" is a 10,000 x 10,000 pixel RGB photo, the tif is uncompressed strip format and the png is also uncompressed, both images were in disc cache, and "avg" finds and prints the average pixel value.
vips has its own ".v" format which is a simply a huge buffer of pixels. This format can be read in parallel with mmap() and is a bit quicker again:
$ time vips avg wtc.v
117.853995
real 0m0.162s
user 0m0.460s
sys 0m0.092s
If your images can be compressed the tradeoffs shift a bit. For example, jpeg will typically compress 10x, so decode speed becomes much more important than disc speed. You'd want to use an optimised decode library like libturbojpeg and process several files at once.
$ time vips avg wtc.jpg
117.853995
real 0m1.413s
user 0m1.696s
sys 0m0.564s
PNG uses libz and for photographic images won't get more than about 2x compression. Even at the same compression levels it's quite a lot slower than tif with deflate:
$ time vips avg wtc.tif
117.853995
real 0m3.154s
user 0m3.496s
sys 0m0.540s
$ time vips avg wtc.png
117.853995
real 0m4.888s
user 0m5.196s
sys 0m0.556s
$ ls -l wtc.*
-rw-r--r-- 1 john john 15150881 Feb 20 2012 wtc.jpg
-rw-rw-r-- 1 john john 135803013 May 18 12:47 wtc.png
-rw-rw-r-- 1 john john 143807446 May 18 12:53 wtc.tif
-rw-rw-r-- 1 john john 263509369 May 18 12:37 wtc.v
I suppose the other factor is how time-consuming your processing is. If you're doing something intensive, read speed and decode speed will not be important.
You should reverse the order of reading. That is, in the first pass read from image 1 to image N, then in the second pass read from image N to image 1, then in the third pass read from image 1 to image N and so on. That way you'll hit the disk cache more.
Processing (or at least loading) several images at once, in different threads, might benefit the overall throughput too, because the OS will then be able to optimize the disk seeks.
If the OS have a good support for AIO then it might be beneficial as well.
Putting images into a single file might indeed help to minimize the seeks (depends on the filesystem defragmentation policies, though). In that case you should use an archive with fast access to a single file, in order to be able to read files in reverse order, e.g. "zip" with no compression.
With memory mapping there should be an option to ask the OS to prefetch a portion of the memory mapped file (e.g. MAP_POPULATE). Reading large portions of the archive that way might be faster then reading it block by block.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With