Why does Clang generate different code for reference and non-null pointer arguments?

Tags:

This is related to Why can't GCC generate an optimal operator== for a struct of two int32s?. I was playing around with the code from that question at godbolt.org and noticed this odd behavior.

struct Point {
    int x, y;
};

bool nonzero_ptr(Point const* a) {
    return a->x || a->y;
}

bool nonzero_ref(Point const& a) {
    return a.x || a.y;
}

https://godbolt.org/z/e49h6d

For nonzero_ptr, clang -O3 (all versions) produces this or similar code:

    mov     al, 1
    cmp     dword ptr [rdi], 0
    je      .LBB0_1
    ret
.LBB0_1:
    cmp     dword ptr [rdi + 4], 0
    setne   al
    ret

This strictly implements the short-circuiting behavior of the C++ function, loading the y field only if the x field is zero.

For nonzero_ref, clang 3.6 and earlier generate the same code as they do for nonzero_ptr, but clang 3.7 through 11.0.1 produce

    mov     eax, dword ptr [rdi + 4]
    or      eax, dword ptr [rdi]
    setne   al
    ret

which loads y unconditionally. No version of clang is willing to do that when the parameter is a pointer. Why?

The only situation I can think of (on the x64 platform) where the behavior of the branching code would be observably different is when there's no memory mapped at [rdi+4], but I'm still unsure why clang would consider that case important for pointers and not references. My best guess is that there is some language-lawyery argument that references must be to "full objects" and pointers needn't be:

char* p = alloc_4k_page_surrounded_by_guard_pages();
int* pi = reinterpret_cast<int*>(p + 4096 - sizeof(int));
Point* ppt = reinterpret_cast<Point*>(pi);  // ok???
ppt->x = 42;  // ok???
Point& rpt = *ppt;  // UB???

But if the spec implies that, I'm not seeing how.

489

asked Feb 21 '21 02:02

benrg

2 Answers

This is a missed optimization; the branchless code is safe for both C++ source versions.

In Why is gcc allowed to speculatively load from a struct? GCC actually is speculatively loading both struct members through a pointer even though the C source only references one or the other. So at least GCC developers have decided that this optimization is 100% safe, in their interpretation of the C and C++ standards (I think that's intentional, not a bug). Clang generates a 0 or 1 index to choose which int to load, so clang is still just as reluctant as in your case to invent a load. (C vs C++: same asm with or without -xc, with a version of the source ported to work as either: https://godbolt.org/z/6oPKKd)

The obvious difference in your asm is that the pointer version avoids access to a->y if a->x != 0, and that this only matters for correctness¹ if a->y was in an unmapped page; you're right about that being the relevant corner case.

But ISO C++ doesn't allow partial objects. The page-boundary setup in your example is I'm pretty sure undefined behaviour. In a path of execution that reads a->x, the compiler can assume it's safe to also read a->y.

This would of course not be the case for int *p; and p[0] || p[1], because it's totally valid to have an implicit-length 0-terminated array that happens to be 1 element long, in the last 4 bytes of a page.

As @Nate suggested in comments, perhaps clang simply doesn't take advantage of that ISO C++ fact when optimizing; maybe it does internally transform to something more like an array by the time it's considering this "if-conversion" type of optimization (branchy to branchless). Or maybe LLVM just don't let itself invent loads through pointers.

It can always do it for reference args because references are guaranteed non-NULL. It would be "even more" UB for the caller to do nonzero_ref(*ppt), like in your partial-object example, because in C++ terms we're dereferencing a pointer to the whole object.

An experiment: deref the pointer to get a full tmp object

bool nonzero_ptr_full_deref(Point const* pa) {
    Point a = *pa;
    return a.x || a.y;
}

https://godbolt.org/z/ejrn9h - compiles branchlessly, same as nonzero_ref. Not sure what / how much this tells us. This is what I expected, given that it makes access to a->y effectively unconditional in the C++ source.

Footnote 1: Like all mainstream ISAs, x86-64 doesn't do hardware race detection, so the possibility of loading something another thread might be writing only matters for performance, and then only if the full struct is split across a cache-line boundary since we're already reading one member. If the object doesn't span a cache line, any false-sharing performance effect is already incurred.

Making asm like this doesn't "introduce data-race UB" because x86 asm has well-defined behaviour for this possibility, unlike ISO C++. The asm works for any possible value loaded from [rdi+4] so it correctly implements the semantics of the C++ source. Inventing reads is thread-safe, unlike writes, and is allowed because it's not volatile so the access isn't a visible side-effect. The only question is whether the pointer must point to a full valid Point object.

Part of data races (on non-atomic objects) being Undefined Behaviour is to allow for C++ implementations on hardware with race detection. Another is to allow compilers to assume that it's safe to reload something they accessed once, and expect the same value unless there's an acquire or seq_cst load between the two points. Even making code that would crash if the 2nd load differed from the first. That's irrelevant in this case because we're not talking about turning 1 access into 2 (instead 0 into 1 whose value may not matter), but is why roll-your-own atomics (e.g. in the Linux kernel) need to use volatile* casts for ACCESS_ONCE (https://lwn.net/Articles/793253/#Invented%20Loads).

answered Oct 03 '22 20:10

Peter Cordes

I believe that from the point of view of standard C++, the compiler could emit the same code for both, since there is no provision in the standard for "partial objects" like the one you've constructed. The fact that it doesn't could simply be a missed optimization.

One could compare code like a->x || b->y where the compiler really does have to emit a branch, since the caller could legally pass a null or invalid pointer for b so long as a->x is nonzero. On the other hand, if a,b are references, then a.x || b.y should not need a branch according to the standard, since they must always be references to valid objects. So the "missed optimization" in your nonzero_ptr could just be the compiler not noticing that it can take advantage of the fact that the pointers in a->x and a->y are the same pointer.

Alternatively, it's possible that clang is, as an extension, trying to produce code that will still work when you use non-standard features to create objects in which only some members can be accessed. The fact that this works for pointers but not for references could be a bug or limitation of that extension, but I don't think it's any sort of conformance violation.

answered Oct 03 '22 20:10

Nate Eldredge

Related questions
                            
                                Extract input iterator from std::copy and std::copy_n
                            
                                How do I get the size of the msg_control buffer for recvmsg?
                            
                                Template variables with template argument deduction and default template parameters
                            
                                Nested class explicit specilization: different compiler behavior
                            
                                How to define a Hash class for custom std::basic_string<> specialization class just like std::string?
                            
                                Fibers use cases
                            
                                Include Pistache in C++ project
                            
                                Is the compiler allowed to optimize out dynamic_cast of a volatile pointer when the compiler doesn't see a possible type which can fulfill the cast?
                            
                                Strict aliasing rules broken with templates and inheritance
                            
                                Is getting the decltype of a deduced member function inside the trailing return type of another member function well-formed?
                            
                                C++17: Generic (multiple-inheritance based?) check for template in parameter pack
                            
                                What is the meaning of "if the context from which the specialization is referenced depends on a template parameter"?
                            
                                Variation on the type punning theme: in-place trivial construction
                            
                                unique_ptr < 0 OR what does less than operator do?
                            
                                Conversion to void** on different compilers
                            
                                What is guaranteed with C++ std::atomic at the programmer level?
                            
                                Valid syntax of calling pseudo-destructor for a floating constant
                            
                                Counting parameters of a template template type
                            
                                Why is this friend method not found as expected?
                            
                                Is the transformation of fetch_add(0, memory_order_relaxed/release) to mfence + mov legal?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does Clang generate different code for reference and non-null pointer arguments?

Tags:

c++

pass-by-reference

language-lawyer

x86-64

intel

clang