Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make use of the filesystem cache in Java or Python?

A recent blog post on Elasticsearch website is talking about the features of their new 1.4 beta release.

I am very curious about how they make use of the filesystem cache:

Recent releases have added support for doc values. Essentially, doc values provide the same function as in-memory fielddata, but they are written to disk at index time. The benefit that they provide is that they consume very little heap space. Doc values are read from disk, instead of from memory. While disk access is slow, doc values benefit from the kernel’s filesystem cache. The filesystem cache, unlike the JVM heap, is not constrained by the 32GB limit. By shifting fielddata from the heap to the filesystem cache, you can use smaller heaps which means faster garbage collections and thus more stable nodes.

Before this release, doc values were significantly slower than in-memory fielddata. The changes in this release have improved the performance significantly, making them almost as fast as in-memory fielddata.

Does this mean that we can manipulate the behavior of filesystem cache instead of waiting for the effect from the OS passively? If it is the case, how can we make use of the filesystem cache in normal application developement? Say, if I'm writing a Python or Java program, how can I do this?

like image 523
shihpeng Avatar asked Oct 29 '14 03:10

shihpeng


People also ask

How to implement caching in Python?

There are multiple ways to implement caching. We can create local data structures in our Python processes to build the cache or host the cache as a server that acts as a proxy and serves the requests. There are built-in Python tools such as using cached_property decorator from functools library.

How to create a in memory cache in Java?

In this post, we’ll go through the steps to create a in memory cache in java. To create a cache, we can simply use a map / dictionary data structure and we can get the expected result of O (1) for both get and put operation. But, we can’t store everything in our cache.

How does the cache work in a database?

Initially, the cache is empty. When the application server gets the data from the database server, it populates the cache with the required data set. From then on, the subsequent requests get the data from the cache instead of going all the way to the application server.

Why is my cache not returning content in Python?

If the caller tries to access an item that’s past its lifetime, then the cache won’t return its content, forcing the caller to fetch the article from the network. Note: For more information about Python decorators, check Primer on Python Decorators and Python Decorators 101.


1 Answers

File-system cache is an implementation detail related to OS inner workings that is transparent to the end user. It isn't something that needs adjustments or changes. Lucene already makes use of the file-system cache when it manages the index segments. Every time something is indexed into Lucene (via Elasticsearch) those documents are written to segments, which are first written to the file-system cache and then, after some time (when the translog - a way of keeping track of documents being indexed - is full for example) the content of the cache is written to an actual file. But, while the documents to be indexed are in file-system cache, they can still be accessed.

This improvement in doc values implementation refers to this feature as being able to use the file-system cache now, as they are read from disk, put in cache and accessed from there, instead of taking up Heap space.

How this file-system cache is being accessed is described in this excellent blog post:

In our previous approaches, we were relying on using a syscall to copy the data between the file system cache and our local Java heap. How about directly accessing the file system cache? This is what mmap does!

Basically mmap does the same like handling the Lucene index as a swap file. The mmap() syscall tells the O/S kernel to virtually map our whole index files into the previously described virtual address space, and make them look like RAM available to our Lucene process. We can then access our index file on disk just like it would be a large byte[] array (in Java this is encapsulated by a ByteBuffer interface to make it safe for use by Java code). If we access this virtual address space from the Lucene code we don’t need to do any syscalls, the processor’s MMU and TLB handles all the mapping for us. If the data is only on disk, the MMU will cause an interrupt and the O/S kernel will load the data into file system cache. If it is already in cache, MMU/TLB map it directly to the physical memory in file system cache.

Related to the actual means of using mmap in a Java program, I think this is the class and method to do so.

like image 161
Andrei Stefan Avatar answered Oct 29 '22 17:10

Andrei Stefan