Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I use os.posix_fadvise to prevent file caching on Linux?

I have a script that generally operates on entire block devices, and if every block that gets read is cached, it will evict data being used by other applications. To prevent this from happening, I added support for using mmap(2) with posix_fadvise(2) with the following logic:

Function for indicating that blocks are no longer needed:

def advise_dont_need(fd, offset, length):
    """
    Announce that data in a particular location is no longer needed.

    Arguments:
    - fd (int): File descriptor.
    - offset (int): Beginning of the unneeded data.
    - length (int): Length of the unneeded data.
    """
    # TODO: macOS support
    if hasattr(os, "posix_fadvise"):
        # posix_fadvise(2) states that "If the application requires that data
        # be considered for discarding, then offset and len must be
        # page-aligned." When this code aligns the offset and length, the
        # advised area is widened under the presumption it is better to discard
        # more memory than needed than to leak it which could cause resource
        # issues.

        # If the offset is unaligned, extend it toward 0 to align it and adjust
        # the length to compensate for the change.
        aligned_offset = offset - offset % PAGE_SIZE
        length += offset - aligned_offset
        offset = aligned_offset

        # If the length is unaligned, widen it to align it.
        length -= length % -PAGE_SIZE

        os.posix_fadvise(fd, offset, length, os.POSIX_FADV_DONTNEED)

Logic that reads the file:

            with open(path, "rb", buffering=0) as file, \
              ProgressBar("Reading file") as progress, timer() as read_loop:
                size = file_size(file)

                if mmap_file:
                    # At the time of this writing, mmap.mmap in CPython uses
                    # st_size to determine the size of a file which will not
                    # work with every file type which is why file size
                    # autodetection (size=0) cannot be used here.
                    fd = file.fileno()
                    view = mmap.mmap(fd, size, prot=mmap.PROT_READ)

                try:
                    while writer.error is None and hash_queue.error is None:
                        # Skip offsets that are already in the block map.
                        if offset in blocks:
                            while offset in blocks:
                                if mmap_file:
                                    advise_dont_need(fd, offset, block_size)

                                offset += block_size

                            if not mmap_file:
                                file.seek(offset)

                        if mmap_file:
                            block = view[offset:offset + block_size]
                            advise_dont_need(fd, offset, len(block))
                        else:
                            block = file.read(block_size)

                        if not block:
                            break

                        bytes_read += len(block)

                        while hash_queue.error is None:
                            try:
                                hash_queue.put((offset, block), timeout=0.1)
                                offset += len(block)
                                progress.update(offset / size)
                                break
                            except queue.Full:
                                pass
                finally:
                    if mmap_file:
                        view.close()

When I run the script and monitor the output of free -h, I can see buffer cache usage increases despite this logic. Is my logic incorrect, or is this the result of posix_fadvise(2) being just that -- advice vs. a mandate?

Here are some logs showing the values of the length and offset toward the end of the script's execution with block_size set to 1048576:

offset=107296587776; length=1048576
offset=107297636352; length=1048576
offset=107298684928; length=1048576
offset=107299733504; length=1048576
offset=107300782080; length=1048576
offset=107301830656; length=1048576
offset=107302879232; length=1048576
offset=107303927808; length=1048576
offset=107304976384; length=0
like image 352
Eric Pruitt Avatar asked Jul 31 '21 21:07

Eric Pruitt


People also ask

What is Linux page cache?

The page cache is the main disk cache used by the Linux kernel. In most cases, the kernel refers to the page cache when reading from or writing to disk. New pages are added to the page cache to satisfy User Mode processes's read requests.


Video Answer


1 Answers

It's not entirely accurate that your script will cause the eviction of application data. Nor is the usage of posix_fadvise interpreted exactly this way. The way the Linux buffer and page caches work is a bit more complicated than that.

First, terminology:

  • Buffer cache - for raw block device access, commonly outside filesystem. Units are blocks. Good way to test those is to dd if=/dev/... (on a block device) of=/dev/null. Doing so several times with time(1) should show considerable time reduction for the 2nd and later time.

  • Page cache - for filesystem based access, units traditionally full pages, indexed by inode, so only one copy maintained per file. Good way to test those is to cp or cat or really any access to a large file, again, several times with time(1) should show the time reduce and the page cache usage increase (but not more than once for same file)

Linux will attempt to maximize both caches' usage. A common way of looking at the usage is via "free(1)":

   [localhost ~]$ free
              total        used        free      shared  buff/cache   available
Mem:        3995408      633820     2241896        5820     1119692     3106196
Swap:       2138108      422408     1715700

The buffers cache here is considered separately, and don't count as "used" because "used" is for processes. If you do need the memory for processes/apps, that takes precedence and the buff/cache will be purged. You can test that by doing a simple program to malloc/memset and watch the cache sizes shrink (to their bare minimum, which is a few megabytes). Other versions of free used to show +/- cache, which was clearer)

Application memory usage: is made up of anonymous memory (sum of malloc(3)s etc) and file mapped memory (mmap(2) on MAP_FILE)). The latter counts as file cache memory, though, not as application memory. Such file mapped memory can be safely evicted as long as it's clean (read-only, or unmodified yet). The former (anonymous) however, if it needs eviction, can only go to swap (since there is no backing file for it).

The posix_fadvise(2) you're using is indeed advice. But if there's enough free memory, your advice will be ineffective - you're saying you won't need it, but then you actually do read the offsets - so Linux will cache the file data: There's enough memory to satisfy it, and you might end up using it again, so why not cache it? It shouldn't cause any eviction of anonymous memory nor notable memory pressure - and it would save time by orders of magnitude if your data was found in the cache (save it would save I/O to disk/flash, which is O(1000+) times slower).

Another way of looking at this: posix_fadvise of DONTNEED is usually when there's a huge file, but you are saying you'll only access certain portions of it, so you're telling the system - don't cache the certain ranges I will not be using. As soon as you do use them, the advice is irrelevant.

Btw, you can also use madvise(2) directly for the mmap(2)ed region, with MADV_DONTNEED, etc.

like image 172
Technologeeks Avatar answered Oct 21 '22 21:10

Technologeeks