Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

`u8string_view` into a `char` array without violating strict-aliasing?

Premise

  • I have a blob of binary data in memory, represented as a char* (maybe read from a file, or transmitted over the network).
  • I know that it contains a UTF8-encoded text field of a certain length at a certain offset.

Question

How can I (safely and portably) get a u8string_view to represent the contents of this text field?

Motivation

The motivation for passing the field to down-stream code as a u8string_view is:

  • It very clearly communicates that the text field is UTF8-encoded, unlike string_view.
  • It avoids the cost (likely free-store allocation + copying) of returning it as u8string.

What I tried

The naive way to do this, would be:

char* data = ...;
size_t field_offset = ...;
size_t field_length = ...;

char8_t* field_ptr = reinterpret_cast<char8_t*>(data + field_offset);
u8string_view field(field_ptr, field_length);

However, if I understand the C++ strict-aliasing rules correctly, this is undefined behavior because it accesses the contents of the char* buffer via the char8_t* pointer returned by reinterpret_cast, and char8_t is not an aliasing type.

Is that true?

Is there a way to do this safely?

like image 223
smls Avatar asked Aug 11 '20 18:08

smls


2 Answers

The strict aliasing rule happen when you access an object with a glvalue that has not an acceptable type.

First consider a well defined case:

char* data = reinterpret_cast <char *> (new char8_t[10]{})
size_t field_offset = 0;
size_t field_length = 10;
char8_t* field_ptr = reinterpret_cast<char8_t*>(data + field_offset);
u8string_view field(field_ptr, field_length);
field [0]+field[1];

There is no UB here. You create an array of char8_t then access the element of the array.

Now what happen if the object that is the memory referenced by data is created by another program? According to the standard this is UB, because the object is not created by one of the specified way to create it.

But the fact that your code is not yet supported by the standard is not a problem here. This code is supported by all compilers. If it were not, nothing would work, you could not even do the simplest system call because most of the communication between a program and any kernel is through array of char. So as long as inside your program you access the memory that is between data+field_offset and data+field_offset+field_length through a glvalue of type char8_t your code will work as expected.

like image 199
Oliv Avatar answered Oct 29 '22 03:10

Oliv


This same problem occurs occasionally in other contexts too, including the use of shared memory for example.

A trick to create objects using bits in "raw" memory without allocating memory is to create a local object by memcpy, and then create a dynamic copy of that local object over the "raw" memory. Example:

char* begin_raw = data + field_offset;
char8_t* last {};
for(std::ptrdiff_t i = 0; i < field_length; i++) {
    char* current = begin_raw + i;
    char8_t local {};
    std::memcpy(&local, current, sizeof local);
    last = new (current) char8_t(local);
}
char8_t* begin = last - (field_length - 1);
std::u8string_view field(begin, field_length);

Before you object that you don't want to copy, notice that the end result causes no changes to the representation of the "raw" memory. The compiler can notice this too, and can compile the entire loop into zero instructions (in my tests GCC and Clang achieve this with -O2). All that we have done is satisfy the object lifetime rules of the language by creating dynamic objects into the memory.

like image 45
eerorika Avatar answered Oct 29 '22 02:10

eerorika