I have a script that generally operates on entire block devices, and if every block that gets read is cached, it will evict data being used by other applications. To prevent this from happening, I added support for using mmap(2) with posix_fadvise(2) with the following logic:
Function for indicating that blocks are no longer needed:
def advise_dont_need(fd, offset, length):
"""
Announce that data in a particular location is no longer needed.
Arguments:
- fd (int): File descriptor.
- offset (int): Beginning of the unneeded data.
- length (int): Length of the unneeded data.
"""
# TODO: macOS support
if hasattr(os, "posix_fadvise"):
# posix_fadvise(2) states that "If the application requires that data
# be considered for discarding, then offset and len must be
# page-aligned." When this code aligns the offset and length, the
# advised area is widened under the presumption it is better to discard
# more memory than needed than to leak it which could cause resource
# issues.
# If the offset is unaligned, extend it toward 0 to align it and adjust
# the length to compensate for the change.
aligned_offset = offset - offset % PAGE_SIZE
length += offset - aligned_offset
offset = aligned_offset
# If the length is unaligned, widen it to align it.
length -= length % -PAGE_SIZE
os.posix_fadvise(fd, offset, length, os.POSIX_FADV_DONTNEED)
Logic that reads the file:
with open(path, "rb", buffering=0) as file, \
ProgressBar("Reading file") as progress, timer() as read_loop:
size = file_size(file)
if mmap_file:
# At the time of this writing, mmap.mmap in CPython uses
# st_size to determine the size of a file which will not
# work with every file type which is why file size
# autodetection (size=0) cannot be used here.
fd = file.fileno()
view = mmap.mmap(fd, size, prot=mmap.PROT_READ)
try:
while writer.error is None and hash_queue.error is None:
# Skip offsets that are already in the block map.
if offset in blocks:
while offset in blocks:
if mmap_file:
advise_dont_need(fd, offset, block_size)
offset += block_size
if not mmap_file:
file.seek(offset)
if mmap_file:
block = view[offset:offset + block_size]
advise_dont_need(fd, offset, len(block))
else:
block = file.read(block_size)
if not block:
break
bytes_read += len(block)
while hash_queue.error is None:
try:
hash_queue.put((offset, block), timeout=0.1)
offset += len(block)
progress.update(offset / size)
break
except queue.Full:
pass
finally:
if mmap_file:
view.close()
When I run the script and monitor the output of free -h
, I can see buffer cache usage increases despite this logic. Is my logic incorrect, or is this the result of posix_fadvise(2) being just that -- advice vs. a mandate?
Here are some logs showing the values of the length and offset toward the end of the script's execution with block_size set to 1048576:
offset=107296587776; length=1048576
offset=107297636352; length=1048576
offset=107298684928; length=1048576
offset=107299733504; length=1048576
offset=107300782080; length=1048576
offset=107301830656; length=1048576
offset=107302879232; length=1048576
offset=107303927808; length=1048576
offset=107304976384; length=0
The page cache is the main disk cache used by the Linux kernel. In most cases, the kernel refers to the page cache when reading from or writing to disk. New pages are added to the page cache to satisfy User Mode processes's read requests.
It's not entirely accurate that your script will cause the eviction of application data. Nor is the usage of posix_fadvise interpreted exactly this way. The way the Linux buffer and page caches work is a bit more complicated than that.
First, terminology:
Buffer cache - for raw block device access, commonly outside filesystem. Units are blocks. Good way to test those is to dd if=/dev/... (on a block device) of=/dev/null. Doing so several times with time(1) should show considerable time reduction for the 2nd and later time.
Page cache - for filesystem based access, units traditionally full pages, indexed by inode, so only one copy maintained per file. Good way to test those is to cp or cat or really any access to a large file, again, several times with time(1) should show the time reduce and the page cache usage increase (but not more than once for same file)
Linux will attempt to maximize both caches' usage. A common way of looking at the usage is via "free(1)":
[localhost ~]$ free
total used free shared buff/cache available
Mem: 3995408 633820 2241896 5820 1119692 3106196
Swap: 2138108 422408 1715700
The buffers cache here is considered separately, and don't count as "used" because "used" is for processes. If you do need the memory for processes/apps, that takes precedence and the buff/cache will be purged. You can test that by doing a simple program to malloc/memset and watch the cache sizes shrink (to their bare minimum, which is a few megabytes). Other versions of free used to show +/- cache, which was clearer)
Application memory usage: is made up of anonymous memory (sum of malloc(3)s etc) and file mapped memory (mmap(2) on MAP_FILE)). The latter counts as file cache memory, though, not as application memory. Such file mapped memory can be safely evicted as long as it's clean (read-only, or unmodified yet). The former (anonymous) however, if it needs eviction, can only go to swap (since there is no backing file for it).
The posix_fadvise(2) you're using is indeed advice. But if there's enough free memory, your advice will be ineffective - you're saying you won't need it, but then you actually do read the offsets - so Linux will cache the file data: There's enough memory to satisfy it, and you might end up using it again, so why not cache it? It shouldn't cause any eviction of anonymous memory nor notable memory pressure - and it would save time by orders of magnitude if your data was found in the cache (save it would save I/O to disk/flash, which is O(1000+) times slower).
Another way of looking at this: posix_fadvise of DONTNEED is usually when there's a huge file, but you are saying you'll only access certain portions of it, so you're telling the system - don't cache the certain ranges I will not be using. As soon as you do use them, the advice is irrelevant.
Btw, you can also use madvise(2) directly for the mmap(2)ed region, with MADV_DONTNEED, etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With