When mmap()ing a text file, like so
int fd = open("file.txt", O_RDWR);
fstat(fd, &sb)
char *text = mmap(0, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
the file contents are mapped into memory directly, and text
it will not contain a NUL-terminator so operating on it with normal string functions would not be safe. On Linux (at least) the remaining bytes of the unused page are zero-filled, so effectively you get a NUL terminator in all cases where the file size isn't a multiple of the page size.
But relying on that feels dirty and other mmap()
implementations (e.g., in FreeBSD, I think) don't zero-fill partial pages. Mapping files that are multiples of the page size will also lack the NUL terminator.
Are there reasonable ways to work around this or to add the NUL terminator?
Things I've considered
strn*()
functions exclusively and tracking distance to the end of the buffer.
str*()
functions don't have strn*()
counterpart, like strstr
.str*()
functionsMAP_FIXED
is not thread-safe; Seems like an awful hack anywaymmap()
an extra byte and make the map writeable, and write the NUL terminator. The OpenGroup's mmap man page says you can make a mapping larger than your object's size but that accessing data outside of the actual mapped object will generate a SIGBUS
.
str*()
functionsSIGBUS
, which could potentially mean something else happened. I'm not actually sure writing the NUL terminator will work?ftruncate()
by one byte.
str*()
functions; ftruncate()
will write a NUL byte to the newly allocated area for yoummap()
implementations that don't zero-fill partial pagesread()
the file into some malloc()
'd memory and forget about mmap()
malloc()
and extra byte for NULmmap()
Solution #1 seems generally the best, and just requires a some extra work on the part of the functions reading the text.
Are there better alternatives, or are these the best solutions? Are there aspects of these solutions I haven't considered that makes them more or less attractive?
The null terminated strings are basically a sequence of characters, and the last element is one null character (denoted by '\0'). When we write some string using double quotes (“…”), then it is converted into null terminated strings by the compiler.
All character strings are terminated with a null character. The null character indicates the end of the string. Such strings are called null-terminated strings. The null terminator of a multibyte string consists of one byte whose value is 0.
Actually, as of C++11 std::string is guaranteed to be null terminated. Specifically, s[s. size()] will always be '\0' .
I would suggest undergoing a paradigm shift here.
You're looking at the entire universe consisting of '\0'-delimited strings that define your text. Instead of looking at the world this way, why don't you try looking at the world where text is defined as a sequence defined by a beginning and an ending iterator.
You mmap
your file, then initially set the beginning iterator, call it beg_iter
to the start of the mmap-ed segment, and the ending iterator, call it end_iter
, to the first byte following the last byte in the mmap-ed segment, or beg_iter+number_of_pages*pagesize
, then until either
A) end_iter
equals beg_iter
, or
B) beg_iter[-1]
is not a null character, then
C) decrement end_iter
, and go back to step A.
When you're done, you have a pair of iterators, the beginning iterator value, and the ending iterator value that define your text string.
Of course, in this case, your iterators are plain char *
, but that's really not very important. What is important is that now you find yourself with a rich set of algorithms and templates from the C++ standard library at your disposal, that let you implement many complicated operations, both mutable (like std::transform
), and non-mutable, (like std::find
).
Null-terminated strings are really a holdover from the days of plain C. With C++, null-terminated strings are somewhat archaic, and mundane. Modern C++ code should use std::string
objects, and sequences defined by beginning and ending iterators.
One small footnote: instead of figuring out how much NULL
padding you ended up mmap-ing(), you might find it easier to fstat() the file, and get the file's exact length, in bytes, before mmap-ing it. Then you'll now exactly know much got mmaped, and you don't have to reverse-engineer it, by looking at the padding.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With