Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to work around lack of NUL terminator in strings returned from mmap()?

When mmap()ing a text file, like so

int fd = open("file.txt", O_RDWR);
fstat(fd, &sb)
char *text = mmap(0, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);

the file contents are mapped into memory directly, and text it will not contain a NUL-terminator so operating on it with normal string functions would not be safe. On Linux (at least) the remaining bytes of the unused page are zero-filled, so effectively you get a NUL terminator in all cases where the file size isn't a multiple of the page size.

But relying on that feels dirty and other mmap() implementations (e.g., in FreeBSD, I think) don't zero-fill partial pages. Mapping files that are multiples of the page size will also lack the NUL terminator.

Are there reasonable ways to work around this or to add the NUL terminator?

Things I've considered

  1. Using strn*() functions exclusively and tracking distance to the end of the buffer.
    • Pros: No need for NUL terminator
    • Cons: Extra tracking required to know distance to end of file when parsing text; some str*() functions don't have strn*() counterpart, like strstr.
  2. As another answer suggested, make a anonymous mapping at a fixed address following the mapping of your text file.
    • Pros: Can use regular C str*() functions
    • Cons: Using MAP_FIXED is not thread-safe; Seems like an awful hack anyway
  3. mmap() an extra byte and make the map writeable, and write the NUL terminator. The OpenGroup's mmap man page says you can make a mapping larger than your object's size but that accessing data outside of the actual mapped object will generate a SIGBUS.
    • Pros: Can use regular C str*() functions
    • Cons: Requires handling (ignoring?) SIGBUS, which could potentially mean something else happened. I'm not actually sure writing the NUL terminator will work?
  4. Expand files with sizes that are multiples of page size with ftruncate() by one byte.
    • Pros: Can use regular C str*() functions; ftruncate() will write a NUL byte to the newly allocated area for you
    • Cons: Means we have to write to the files, which may not be possible or acceptable in all cases; Doesn't solve problem for mmap() implementations that don't zero-fill partial pages
  5. Just read() the file into some malloc()'d memory and forget about mmap()
    • Pros: Avoids all of these solutions; Easy to malloc() and extra byte for NUL
    • Cons: Different performance characteristics than mmap()

Solution #1 seems generally the best, and just requires a some extra work on the part of the functions reading the text.

Are there better alternatives, or are these the best solutions? Are there aspects of these solutions I haven't considered that makes them more or less attractive?

like image 350
mattst88 Avatar asked Nov 24 '14 01:11

mattst88


People also ask

How do you null terminate a string?

The null terminated strings are basically a sequence of characters, and the last element is one null character (denoted by '\0'). When we write some string using double quotes (“…”), then it is converted into null terminated strings by the compiler.

Do all strings have a null terminator?

All character strings are terminated with a null character. The null character indicates the end of the string. Such strings are called null-terminated strings. The null terminator of a multibyte string consists of one byte whose value is 0.

Does std::string add null terminator?

Actually, as of C++11 std::string is guaranteed to be null terminated. Specifically, s[s. size()] will always be '\0' .


1 Answers

I would suggest undergoing a paradigm shift here.

You're looking at the entire universe consisting of '\0'-delimited strings that define your text. Instead of looking at the world this way, why don't you try looking at the world where text is defined as a sequence defined by a beginning and an ending iterator.

You mmap your file, then initially set the beginning iterator, call it beg_iter to the start of the mmap-ed segment, and the ending iterator, call it end_iter, to the first byte following the last byte in the mmap-ed segment, or beg_iter+number_of_pages*pagesize, then until either

A) end_iter equals beg_iter, or

B) beg_iter[-1] is not a null character, then

C) decrement end_iter, and go back to step A.

When you're done, you have a pair of iterators, the beginning iterator value, and the ending iterator value that define your text string.

Of course, in this case, your iterators are plain char *, but that's really not very important. What is important is that now you find yourself with a rich set of algorithms and templates from the C++ standard library at your disposal, that let you implement many complicated operations, both mutable (like std::transform), and non-mutable, (like std::find).

Null-terminated strings are really a holdover from the days of plain C. With C++, null-terminated strings are somewhat archaic, and mundane. Modern C++ code should use std::string objects, and sequences defined by beginning and ending iterators.

One small footnote: instead of figuring out how much NULL padding you ended up mmap-ing(), you might find it easier to fstat() the file, and get the file's exact length, in bytes, before mmap-ing it. Then you'll now exactly know much got mmaped, and you don't have to reverse-engineer it, by looking at the padding.

like image 63
Sam Varshavchik Avatar answered Sep 21 '22 04:09

Sam Varshavchik