I'm working with a bunch of numpy arrays that don't all fit in RAM, so I need to periodically save them to and load them from the disk.
Usually, I know which ones I'll need to read ahead of time, so I'd like to hide the latency by issuing something like a "prefetch" instruction in advance.
How should I do this?
(There is a similar question related to TensorFlow: However, I am not using TensorFlow, and so I wouldn't want to create a dependency on it)
If you're using Python 3.3+ on a UNIX-like system, you can use os.posix_fadvise to initiate a prefetch after opening a file. For example:
with open(filepath, 'rb') as f:
os.posix_fadvise(f.fileno(), 0, os.stat(f.fileno()).st_size, os.POSIX_FADV_WILLNEED)
... do other stuff ...
# If you're lucky, OS has asynchronously prefetched file contents
stuff = pickle.load(f)
Aside from that, Python doesn't directly offer any APIs for explicit prefetch, but you could use ctypes to manually load an OS appropriate prefetch function, or use a background thread that does nothing but read and discard blocks from the file to improve the odds that the data is in the system cache.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With