Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to manage page cache resources when running Kafka in Kubernetes

I've been running Kafka on Kubernetes without any major issue for a while now; however, I recently introduced a cluster of Cassandra pods and started having performance problems with Kafka.

Even though Cassandra doesn't use page cache like Kafka does, it does make frequent writes to disk, which presumably effects the kernel's underlying cache.

I understand that Kubernetes pods are managing memory resources through cgroups, which can be configured by setting memory requests and limits in Kubernetes, but I've noticed that Cassandra's utilization of page cache can increase the number of page faults in my Kafka pods even when they don't seem to be competing for resources (i.e., there's memory available on their nodes).

In Kafka more page faults leads to more writes to disk, which hamper the benefits of sequential IO and compromise disk performance. If you use something like AWS's EBS volumes, this will eventually deplete your burst balance and eventually cause catastrophic failures across your cluster.

My question is, is it possible to isolate page cache resources in Kubernetes or somehow let the kernel know that pages owned by my Kafka pods should be kept in the cache longer than those in my Cassandra pods?

like image 808
kellanburket Avatar asked Feb 04 '18 15:02

kellanburket


People also ask

How does Kafka use page cache?

Kafka is highly dependent on the file system for storing and caching messages. All the data is written to the page cache in the form of log files, which are flushed to disk later. Generally, most of the modern Linux OS use free memory for disk cache. Kafka ends up utilizing 25-30 GB of page cache for 32 GB memory.

Can Kafka be used as a cache?

It is used for internal caching and compacting of output records before they are written from a stateful processor node to its state stores.

Should you run Kafka on Kubernetes?

Running Kafka on Kubernetes enables organizations to simplify operations such as updates, restarts, and monitoring that are more or less integrated into the Kubernetes platform.

Which of the following is a resource in the cluster in Kubernetes?

There are two types of resources: CPU and Memory. The Kubernetes scheduler uses these to figure out where to run your pods. Here are the docs for these resources. If you are running in Google Kubernetes Engine (GKE), the default Namespace already has some requests and limits set up for you.


1 Answers

I thought this was an interesting question, so this is a posting of some findings from a bit of digging.

Best guess: there is no way with k8s OOB to do this, but enough tooling is available such that it could be a fruitful area for research and development of a tuning and policy application that could be deployed as a DaemonSet.

Findings:

Applications can use the fadvise() system call to provide guidance to the kernel regarding which file-backed pages are needed by the application and which are not and can be reclaimed.

http://man7.org/linux/man-pages/man2/posix_fadvise.2.html

Applications can also use O_DIRECT to attempt to avoid the use of page cache when doing IO:

https://lwn.net/Articles/457667/

There is some indication that Cassandra already uses fadvise in a way that attempts to optimize for reducing its page cache footprint:

http://grokbase.com/t/cassandra/commits/122qha309v/jira-created-cassandra-3948-sequentialwriter-doesnt-fsync-before-posix-fadvise

There is also some recent (Jan 2017) research from Samsung patching Cassandra and fadvise in the kernel to better utilize multi-stream SSDs:

http://www.samsung.com/us/labs/pdfs/collateral/Multi-stream_Cassandra_Whitepaper_Final.pdf

Kafka is page cache architecture aware, though it doesn't appear to use fadvise directly. The knobs available from the kernel are sufficient for tuning Kafka on a dedicated host:

  • vm.dirty* for guidance on when to get written-to (dirty) pages back onto disk
  • vm.vfs_cache_pressure for guidance on how aggressive to be in using RAM for page cache

Support in the kernel for device-specific writeback threads goes way back to the 2.6 days:

https://www.thomas-krenn.com/en/wiki/Linux_Page_Cache_Basics

Cgroups v1 and v2 focus on pid-based IO throttling, not file-based cache tuning:

https://andrestc.com/post/cgroups-io/

That said, the old linux-ftools set of utilities has a simple example of a command-line knob for use of fadvise on specific files:

https://github.com/david415/linux-ftools

So there's enough there. Given specific kafka and cassandra workloads (e.g. read-heavy vs write-heavy), specific prioritizations (kafka over cassandra or vice versa) and specific IO configurations (dedicated vs shared devices), one could emerge with a specific tuning model, and those could be generalized into a policy model.

like image 60
Jonah Benton Avatar answered Oct 13 '22 14:10

Jonah Benton