Mapping an input file into memory and then directly parsing data from the mapped memory pages can be a convenient and efficient way to read data from files.
However, this practice also seems fundamentally unsafe unless you can ensure that no other process writes to a mapped file, because even the data in private read-only mappings may change if the underlying file is written to by another process. (POSIX e.g. doesn't specify "whether modifications to the underlying object done after the MAP_PRIVATE mapping is established are visible through the MAP_PRIVATE mapping".)
If you wanted to make your code safe in the presence of external changes to the mapped file, you'd have to access the mapped memory only through volatile pointers and then be extremely careful about how you read and validate the input, which seems impractical for many use cases.
Is this analysis correct? The documentation for memory mapping APIs generally mentions this issue only in passing, if at all, so I wonder whether I'm missing something.
Yes. If one thread changes part of the data in the mapping, then all other threads immediately see that change.
But there are also disadvantages: An I/O error on a memory-mapped file cannot be caught and dealt with by SQLite. Instead, the I/O error causes a signal which, if not caught by the application, results in a program crash.
The principal benefits of memory-mapping are efficiency, faster file access, the ability to share memory between applications, and more efficient coding.
Memory-mapped files are accessed through the operating system's memory manager, so the file is automatically partitioned into a number of pages and accessed as needed. You do not have to handle the memory management yourself.
It is not really a problem.
Yes, another process may modify the file while you have it mapped, and yes, it is possible that you will see the modifications. It is even likely, since almost all operating systems have unified virtual memory systems, so unless one requests unbuffered writes, there's no way of writing without going through the buffer cache, and no way without someone holding a mapping seeing the change.
That isn't even a bad thing. Actually, it would be more disturbing if you couldn't see the changes. Since the file quasi becomes part of your address space when you map it, it makes perfect sense that you see changes to the file.
If you use conventional I/O (such as read
), someone can still modify the file while you are reading it. Worded differently, copying file content to a memory buffer is not always safe in presence of modifications. It is "safe" insofar as read
will not crash, but it does not guarantee that your data is consistent.
Unless you use readv
, you have no guarantees about atomicity whatsoever (and even with readv
you have no guarantee that what you have in memory is consistent with what is on disk or that it doesn't change between two calls to readv
). Someone might modify the file between two read
operations, or even while you are in the middle of it.
This isn't just something that isn't formally guaranteed but "probably still works" -- on the contrary, e.g. under Linux writes are demonstrably not atomic. Not even by accident.
The good news:
Usually, processes don't just open an arbitrary random file and start writing to it. When such a thing happens, it is usually either a well-known file that belongs to the process (e.g. log file), or a file that you explicitly told the process to write to (e.g. saving in a text editor), or the process creates a new file (e.g. compiler creating an object file), or the process merely appends to an existing file (e.g. db journals, and of course, log files). Or, a process might atomically replace a file with another one (or unlink it).
In every case, the whole scary problem boils down to "no issue" because either you are well aware of what will happen (so it's your responsibility), or it works seamlessly without interfering.
If you really don't like the possibility that another process could possibly write to your file while you have it mapped, you can simply omit FILE_SHARE_WRITE
under Windows when you create the file handle. POSIX makes it somewhat more complicated since you need to fcntl
the descriptor for a mandatory lock, which isn't necessary supported or 100% reliable on every system (for example, under Linux).
In theory, you're probably in real trouble if someone does modify the file while you're reading it. In practice: you're reading characters, and nothing else: no pointers, or anything which could get you into trouble. In practice... formally, I think it's still undefined behavior, but it's one which I don't think you have to worry about. Unless the modifications are very minor, you'll get a lot of compiler errors, but that's about the end of it.
The one case which might cause problems is if the file was shortened. I'm not sure what happens then, when you're reading beyond the end.
And finally: the system isn't arbitrarily going to open and modify the file. It's a source file; it will be some idiot programmer who does it, and he deserves what he gets. In no case will your undefined behavior corrupt the system or other peoples files.
Note too that most editors work on a private copy; when the
write back, they do so by renaming the original, and creating
a new file. Under Unix, once you've opened the file to mmap
it, all that counts is the inode number. And when the editor
renames or deletes the file, you still keep your copy. The
modified file will get a new inode. The only thing you have to
worry about is if someone opens the file for update, and then
goes around modifying it. Not many programs do this on text
files, except for appending additional data to the end.
So while formally, there's some risk, I don't think you have to
worry about it. (If you're really paranoid, you could turn off
write authorisation while you're mmap
ed. And if there's
really an enemy agent out to get your, he can turn it right back
on.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With