This is related to Why can't GCC generate an optimal operator== for a struct of two int32s?. I was playing around with the code from that question at godbolt.org and noticed this odd behavior.
struct Point {
int x, y;
};
bool nonzero_ptr(Point const* a) {
return a->x || a->y;
}
bool nonzero_ref(Point const& a) {
return a.x || a.y;
}
https://godbolt.org/z/e49h6d
For nonzero_ptr
, clang -O3 (all versions) produces this or similar code:
mov al, 1
cmp dword ptr [rdi], 0
je .LBB0_1
ret
.LBB0_1:
cmp dword ptr [rdi + 4], 0
setne al
ret
This strictly implements the short-circuiting behavior of the C++ function, loading the y
field only if the x
field is zero.
For nonzero_ref
, clang 3.6 and earlier generate the same code as they do for nonzero_ptr
, but clang 3.7 through 11.0.1 produce
mov eax, dword ptr [rdi + 4]
or eax, dword ptr [rdi]
setne al
ret
which loads y
unconditionally. No version of clang is willing to do that when the parameter is a pointer. Why?
The only situation I can think of (on the x64 platform) where the behavior of the branching code would be observably different is when there's no memory mapped at [rdi+4]
, but I'm still unsure why clang would consider that case important for pointers and not references. My best guess is that there is some language-lawyery argument that references must be to "full objects" and pointers needn't be:
char* p = alloc_4k_page_surrounded_by_guard_pages();
int* pi = reinterpret_cast<int*>(p + 4096 - sizeof(int));
Point* ppt = reinterpret_cast<Point*>(pi); // ok???
ppt->x = 42; // ok???
Point& rpt = *ppt; // UB???
But if the spec implies that, I'm not seeing how.
Clang Design: Like many other compilers design, Clang compiler has three phase: The front end that parses source code, checking it for errors, and builds a language-specific Abstract Syntax Tree (AST) to represent the input code. The optimizer: its goal is to do some optimization on the AST generated by the front end.
clang is a C, C++, and Objective-C compiler which encompasses preprocessing, parsing, optimization, code generation, assembly, and linking. Depending on which high-level mode setting is passed, Clang will stop before doing a full link.
The Clang Compiler is an open-source compiler for the C family of programming languages, aiming to be the best in class implementation of these languages. Clang builds on the LLVM optimizer and code generator, allowing it to provide high-quality optimization and code generation support for many targets.
Clang / ˈklæŋ / is a compiler front end for the C, C++, Objective-C, and Objective-C++ programming languages, as well as the OpenMP, OpenCL, RenderScript, CUDA, and HIP frameworks. It acts as a drop-in replacement for the GNU Compiler Collection (GCC), supporting most of its compilation flags and unofficial language extensions.
It includes a static analyzer, and several code analysis tools. Clang operates in tandem with the LLVM compiler back end and has been a subproject of LLVM 2.6 and later.
Clang is compatible with GCC. Its command-line interface shares many of GCC's flags and options. Clang implements many GNU language extensions and compiler intrinsics, some of which are purely for compatibility.
Clang's error reports are more detailed, specific, and machine-readable, so IDEs can index the compiler's output. Modular design of the compiler can offer source code indexing, syntax checking, and other features normally associated with rapid application development systems.
This is a missed optimization; the branchless code is safe for both C++ source versions.
In Why is gcc allowed to speculatively load from a struct? GCC actually is speculatively loading both struct members through a pointer even though the C source only references one or the other. So at least GCC developers have decided that this optimization is 100% safe, in their interpretation of the C and C++ standards (I think that's intentional, not a bug). Clang generates a 0 or 1 index to choose which int
to load, so clang is still just as reluctant as in your case to invent a load. (C vs C++: same asm with or without -xc
, with a version of the source ported to work as either: https://godbolt.org/z/6oPKKd)
The obvious difference in your asm is that the pointer version avoids access to a->y
if a->x != 0
, and that this only matters for correctness1 if a->y
was in an unmapped page; you're right about that being the relevant corner case.
But ISO C++ doesn't allow partial objects. The page-boundary setup in your example is I'm pretty sure undefined behaviour. In a path of execution that reads a->x
, the compiler can assume it's safe to also read a->y
.
This would of course not be the case for int *p;
and p[0] || p[1]
, because it's totally valid to have an implicit-length 0-terminated array that happens to be 1 element long, in the last 4 bytes of a page.
As @Nate suggested in comments, perhaps clang simply doesn't take advantage of that ISO C++ fact when optimizing; maybe it does internally transform to something more like an array by the time it's considering this "if-conversion" type of optimization (branchy to branchless). Or maybe LLVM just don't let itself invent loads through pointers.
It can always do it for reference args because references are guaranteed non-NULL. It would be "even more" UB for the caller to do nonzero_ref(*ppt)
, like in your partial-object example, because in C++ terms we're dereferencing a pointer to the whole object.
bool nonzero_ptr_full_deref(Point const* pa) {
Point a = *pa;
return a.x || a.y;
}
https://godbolt.org/z/ejrn9h - compiles branchlessly, same as nonzero_ref
. Not sure what / how much this tells us. This is what I expected, given that it makes access to a->y
effectively unconditional in the C++ source.
Footnote 1: Like all mainstream ISAs, x86-64 doesn't do hardware race detection, so the possibility of loading something another thread might be writing only matters for performance, and then only if the full struct is split across a cache-line boundary since we're already reading one member. If the object doesn't span a cache line, any false-sharing performance effect is already incurred.
Making asm like this doesn't "introduce data-race UB" because x86 asm has well-defined behaviour for this possibility, unlike ISO C++. The asm works for any possible value loaded from [rdi+4]
so it correctly implements the semantics of the C++ source. Inventing reads is thread-safe, unlike writes, and is allowed because it's not volatile
so the access isn't a visible side-effect. The only question is whether the pointer must point to a full valid Point
object.
Part of data races (on non-atomic
objects) being Undefined Behaviour is to allow for C++ implementations on hardware with race detection. Another is to allow compilers to assume that it's safe to reload something they accessed once, and expect the same value unless there's an acquire or seq_cst load between the two points. Even making code that would crash if the 2nd load differed from the first. That's irrelevant in this case because we're not talking about turning 1 access into 2 (instead 0 into 1 whose value may not matter), but is why roll-your-own atomics (e.g. in the Linux kernel) need to use volatile*
casts for ACCESS_ONCE
(https://lwn.net/Articles/793253/#Invented%20Loads).
I believe that from the point of view of standard C++, the compiler could emit the same code for both, since there is no provision in the standard for "partial objects" like the one you've constructed. The fact that it doesn't could simply be a missed optimization.
One could compare code like a->x || b->y
where the compiler really does have to emit a branch, since the caller could legally pass a null or invalid pointer for b
so long as a->x
is nonzero. On the other hand, if a
,b
are references, then a.x || b.y
should not need a branch according to the standard, since they must always be references to valid objects. So the "missed optimization" in your nonzero_ptr
could just be the compiler not noticing that it can take advantage of the fact that the pointers in a->x
and a->y
are the same pointer.
Alternatively, it's possible that clang is, as an extension, trying to produce code that will still work when you use non-standard features to create objects in which only some members can be accessed. The fact that this works for pointers but not for references could be a bug or limitation of that extension, but I don't think it's any sort of conformance violation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With