I've been running Kafka on Kubernetes without any major issue for a while now; however, I recently introduced a cluster of Cassandra pods and started having performance problems with Kafka.
Even though Cassandra doesn't use page cache like Kafka does, it does make frequent writes to disk, which presumably effects the kernel's underlying cache.
I understand that Kubernetes pods are managing memory resources through cgroups, which can be configured by setting memory requests and limits in Kubernetes, but I've noticed that Cassandra's utilization of page cache can increase the number of page faults in my Kafka pods even when they don't seem to be competing for resources (i.e., there's memory available on their nodes).
In Kafka more page faults leads to more writes to disk, which hamper the benefits of sequential IO and compromise disk performance. If you use something like AWS's EBS volumes, this will eventually deplete your burst balance and eventually cause catastrophic failures across your cluster.
My question is, is it possible to isolate page cache resources in Kubernetes or somehow let the kernel know that pages owned by my Kafka pods should be kept in the cache longer than those in my Cassandra pods?
Kafka is highly dependent on the file system for storing and caching messages. All the data is written to the page cache in the form of log files, which are flushed to disk later. Generally, most of the modern Linux OS use free memory for disk cache. Kafka ends up utilizing 25-30 GB of page cache for 32 GB memory.
It is used for internal caching and compacting of output records before they are written from a stateful processor node to its state stores.
Running Kafka on Kubernetes enables organizations to simplify operations such as updates, restarts, and monitoring that are more or less integrated into the Kubernetes platform.
There are two types of resources: CPU and Memory. The Kubernetes scheduler uses these to figure out where to run your pods. Here are the docs for these resources. If you are running in Google Kubernetes Engine (GKE), the default Namespace already has some requests and limits set up for you.
I thought this was an interesting question, so this is a posting of some findings from a bit of digging.
Best guess: there is no way with k8s OOB to do this, but enough tooling is available such that it could be a fruitful area for research and development of a tuning and policy application that could be deployed as a DaemonSet.
Findings:
Applications can use the fadvise() system call to provide guidance to the kernel regarding which file-backed pages are needed by the application and which are not and can be reclaimed.
http://man7.org/linux/man-pages/man2/posix_fadvise.2.html
Applications can also use O_DIRECT to attempt to avoid the use of page cache when doing IO:
https://lwn.net/Articles/457667/
There is some indication that Cassandra already uses fadvise in a way that attempts to optimize for reducing its page cache footprint:
http://grokbase.com/t/cassandra/commits/122qha309v/jira-created-cassandra-3948-sequentialwriter-doesnt-fsync-before-posix-fadvise
There is also some recent (Jan 2017) research from Samsung patching Cassandra and fadvise in the kernel to better utilize multi-stream SSDs:
http://www.samsung.com/us/labs/pdfs/collateral/Multi-stream_Cassandra_Whitepaper_Final.pdf
Kafka is page cache architecture aware, though it doesn't appear to use fadvise directly. The knobs available from the kernel are sufficient for tuning Kafka on a dedicated host:
Support in the kernel for device-specific writeback threads goes way back to the 2.6 days:
https://www.thomas-krenn.com/en/wiki/Linux_Page_Cache_Basics
Cgroups v1 and v2 focus on pid-based IO throttling, not file-based cache tuning:
https://andrestc.com/post/cgroups-io/
That said, the old linux-ftools set of utilities has a simple example of a command-line knob for use of fadvise on specific files:
https://github.com/david415/linux-ftools
So there's enough there. Given specific kafka and cassandra workloads (e.g. read-heavy vs write-heavy), specific prioritizations (kafka over cassandra or vice versa) and specific IO configurations (dedicated vs shared devices), one could emerge with a specific tuning model, and those could be generalized into a policy model.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With