When developing applications that work with compressed on-disk indexes or on-disk files where parts of the index or the file are accessed repetitively (for arguments sake, let's say with something akin to a Zipfian distribution), I wonder when is it sufficient/better to rely on OS-level caching (e.g., memory mapping on say a Debian system), and when is it better to implement something on the application layer (e.g., something like FileChannel buffering or Memcached or a custom LRU-cache in Java code).
For example, one article (in reference to Solr) argues for leaving memory free for OS-caching:
The OS’s cache is really useful, it decreases significantly the time required to answer a query (even after completely restarting the server!), so always remember to keep some memory free for the OS.
This got me wondering whether or not my application-level cache that fills memory with weak maps to LRU Java objects is doing more harm than good, esp. since Java is so greedy in terms of memory overhead ... instead of using that memory to cache a few final result objects, would that space be better used by the OS to cache lots of raw compressed data? On the other hand, the application layer cache would be better for platform independence, allowing for caching no matter what OS the code was running on.
And so I realised that I had no idea how to go about answering that question in a principled way, other than running a couple of specific benchmarks. Which leads me to ask ...
What general guidelines exist for whether to assign available memory for application-level caching, or to leave that memory available for OS-level caching?
In particular, I'd love to be able to better recognise when coding an application-level cache is a waste of time, or even harmful for performance.
Ultimately the answer is always to measure first, analyze, and then optimize. Run your application under a profiler with and without caching, and see what the differences are. There is simply no substitute for direct observation.
Having said that, there is a principled way to think about your problem. Think about what a cache can do for you:
So, specific to your situation you need to ask the following questions.
fread()
call, then maybe the cache isn't doing very much for you. But if a cache hit avoids randomly traversing a few hundred blocks of several very large files, then maybe it is saving you a lot of time. It could also save you a lot of space in the OS's disk cache, making that memory available for other OS operations.It is often very helpful to make these aspects of your application configurable (whether or not to use the cache, how much memory to set aside for caching, etc..) and the play with the settings to see what works the best for a given scenario.
One of the most interesting developments these days is the availability of solid-state drives. The throughput on these drives isn't as fast as on the better spindles, but random access is often outstanding. That definitely changes things.
Again, there is no substitute for profiling your code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With