Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

gcc-10.0.1 Specific Segfault

Tags:

I have an R package with C compiled code that's been relatively stable for quite a while and is frequently tested against a broad variety of platforms and compilers (windows/osx/debian/fedora gcc/clang).

More recently a new platform was added to test the package again:

Logs from checks with gcc trunk aka 10.0.1 compiled from source on Fedora 30. (For some archived packages, 10.0.0.)  x86_64 Fedora 30 Linux  FFLAGS="-g -O2 -mtune=native -Wall -fallow-argument-mismatch" CFLAGS="-g -O2 -Wall -pedantic -mtune=native -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong -fstack-clash-protection -fcf-protection" CXXFLAGS="-g -O2 -Wall -pedantic -mtune=native -Wno-ignored-attributes -Wno-deprecated-declarations -Wno-parentheses -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong -fstack-clash-protection -fcf-protection" 

At which point the compiled code promptly started segfaulting along these lines:

 *** caught segfault *** address 0x1d00000001, cause 'memory not mapped' 

I've been able to reproduce the segfault consistently by using the rocker/r-base docker container with gcc-10.0.1 with optimization level -O2. Running a lower optimization gets rid of the problem. Running any other set-up, including under valgrind (both -O0 and -O2), UBSAN (gcc/clang), shows no problems at all. I'm also reasonably sure this ran under gcc-10.0.0, but don't have the data.

I ran the gcc-10.0.1 -O2 version with gdb and noticed something that seems odd to me:

gdb vs code

While stepping through the highlighted section it appears the initialization of the second elements of the arrays is skipped (R_alloc is a wrapper around malloc that self garbage collects when returning control to R; the segfault happens before return to R). Later, the program crashes when the un-initialized element (in the gcc.10.0.1 -O2 version) is accessed.

I fixed this by explicitly initializing the element in question everywhere in the code that eventually led to the usage of the element, but it really should have been initialized to an empty string, or at least that's what I would have assumed.

Am I missing something obvious or doing something stupid? Both are reasonably likely as C is my second language by far. It's just strange that this just cropped up now, and I can't figure out what the compiler is trying to do.


UPDATE: Instructions to reproduce this, although this will only reproduce so long as debian:testing docker container has gcc-10 at gcc-10.0.1. Also, don't just run these commands if you don't trust me.

Sorry this is not a minimal reproducible example.

docker pull rocker/r-base docker run --rm -ti --security-opt seccomp=unconfined \   rocker/r-base /bin/bash apt-get update apt-get install gcc-10 gdb gcc-10 --version  # confirm 10.0.1 # gcc-10 (Debian 10-20200222-1) 10.0.1 20200222 (experimental)  # [master revision 01af7e0a0c2:487fe13f218:e99b18cf7101f205bfdd9f0f29ed51caaec52779]  mkdir ~/.R touch ~/.R/Makevars echo "CC = gcc-10 CFLAGS = -g -O2 -Wall -pedantic -mtune=native -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong -fstack-clash-protection -fcf-protection " >> ~/.R/Makevars  R -d gdb --vanilla 

Then in the R console, after typing run to get gdb to run the program:

f.dl <- tempfile() f.uz <- tempfile()  github.url <- 'https://github.com/brodieG/vetr/archive/v0.2.8.zip'  download.file(github.url, f.dl) unzip(f.dl, exdir=f.uz) install.packages(   file.path(f.uz, 'vetr-0.2.8'), repos=NULL,   INSTALL_opts="--install-tests", type='source' ) # minimal set of commands to segfault library(vetr) alike(pairlist(a=1, b="character"), pairlist(a=1, b=letters)) alike(pairlist(1, "character"), pairlist(1, letters)) alike(NULL, 1:3)                  # not a wild card at top level alike(list(NULL), list(1:3))      # but yes when nested alike(list(NULL, NULL), list(list(list(1, 2, 3)), 1:25)) alike(list(NULL), list(1, 2)) alike(list(), list(1, 2)) alike(matrix(integer(), ncol=7), matrix(1:21, nrow=3)) alike(matrix(character(), nrow=3), matrix(1:21, nrow=3)) alike(   matrix(integer(), ncol=3, dimnames=list(NULL, c("R", "G", "B"))),   matrix(1:21, ncol=3, dimnames=list(NULL, c("R", "G", "B"))) )  # Adding tests from docs  mx.tpl <- matrix(   integer(), ncol=3, dimnames=list(row.id=NULL, c("R", "G", "B")) ) mx.cur <- matrix(   sample(0:255, 12), ncol=3, dimnames=list(row.id=1:4, rgb=c("R", "G", "B")) ) mx.cur2 <-   matrix(sample(0:255, 12), ncol=3, dimnames=list(1:4, c("R", "G", "B")))  alike(mx.tpl, mx.cur2) 

Inspecting in gdb pretty quickly shows (if I understand correctly) that CSR_strmlen_x is trying to access the string that was not initialized.

UPDATE 2: this is a highly recursive function, and on top of that the string initialization bit gets called many, many times. This is mostly b/c I was being lazy, we only need the strings initialized for the one time we actually encounter something we want to report in the recursion, but it was easier to initialize every time it is possible to encounter something. I mention this because what you'll see next shows multiple initializations, but only one of them (presumably the one with address <0x1400000001>) is being used.

I can't guarantee that the stuff I'm showing here is directly related to the element that caused the segfault (though it is the same illegal address acccess), but as @nate-eldredge asked it does show that the array element is not initialized either just before return or just after return in the calling function. Note the calling function is initializing 8 of these, and I show them all, with all them filled with either garbage or inaccessible memory.

enter image description here

UPDATE 3, disassembly of function in question:

Breakpoint 1, ALIKEC_res_strings_init () at alike.c:75 75    return res; (gdb) p res.current[0] $1 = 0x7ffff46a0aa5 "%s%s%s%s" (gdb) p res.current[1] $2 = 0x1400000001 <error: Cannot access memory at address 0x1400000001> (gdb) disas /m ALIKEC_res_strings_init Dump of assembler code for function ALIKEC_res_strings_init: 53  struct ALIKEC_res_strings ALIKEC_res_strings_init() {    0x00007ffff4687fc0 <+0>: endbr64   54    struct ALIKEC_res_strings res;  55   56    res.target = (const char **) R_alloc(5, sizeof(const char *));    0x00007ffff4687fc4 <+4>: push   %r12    0x00007ffff4687fc6 <+6>: mov    $0x8,%esi    0x00007ffff4687fcb <+11>:    mov    %rdi,%r12    0x00007ffff4687fce <+14>:    push   %rbx    0x00007ffff4687fcf <+15>:    mov    $0x5,%edi    0x00007ffff4687fd4 <+20>:    sub    $0x8,%rsp    0x00007ffff4687fd8 <+24>:    callq  0x7ffff4687180 <R_alloc@plt>    0x00007ffff4687fdd <+29>:    mov    $0x8,%esi    0x00007ffff4687fe2 <+34>:    mov    $0x5,%edi    0x00007ffff4687fe7 <+39>:    mov    %rax,%rbx  57    res.current = (const char **) R_alloc(5, sizeof(const char *));    0x00007ffff4687fea <+42>:    callq  0x7ffff4687180 <R_alloc@plt>  58   59    res.target[0] = "%s%s%s%s";    0x00007ffff4687fef <+47>:    lea    0x1764a(%rip),%rdx        # 0x7ffff469f640    0x00007ffff4687ff6 <+54>:    lea    0x18aa8(%rip),%rcx        # 0x7ffff46a0aa5    0x00007ffff4687ffd <+61>:    mov    %rcx,(%rbx)  60    res.target[1] = "";  61    res.target[2] = "";    0x00007ffff4688000 <+64>:    mov    %rdx,0x10(%rbx)  62    res.target[3] = "";    0x00007ffff4688004 <+68>:    mov    %rdx,0x18(%rbx)  63    res.target[4] = "";    0x00007ffff4688008 <+72>:    mov    %rdx,0x20(%rbx)  64   65    res.tar_pre = "be";  66   67    res.current[0] = "%s%s%s%s";    0x00007ffff468800c <+76>:    mov    %rax,0x8(%r12)    0x00007ffff4688011 <+81>:    mov    %rcx,(%rax)  68    res.current[1] = "";  69    res.current[2] = "";    0x00007ffff4688014 <+84>:    mov    %rdx,0x10(%rax)  70    res.current[3] = "";    0x00007ffff4688018 <+88>:    mov    %rdx,0x18(%rax)  71    res.current[4] = "";    0x00007ffff468801c <+92>:    mov    %rdx,0x20(%rax)  72   73    res.cur_pre = "is";  74   75    return res; => 0x00007ffff4688020 <+96>:    lea    0x14fe0(%rip),%rax        # 0x7ffff469d007    0x00007ffff4688027 <+103>:   mov    %rax,0x10(%r12)    0x00007ffff468802c <+108>:   lea    0x14fcd(%rip),%rax        # 0x7ffff469d000    0x00007ffff4688033 <+115>:   mov    %rbx,(%r12)    0x00007ffff4688037 <+119>:   mov    %rax,0x18(%r12)    0x00007ffff468803c <+124>:   add    $0x8,%rsp    0x00007ffff4688040 <+128>:   pop    %rbx    0x00007ffff4688041 <+129>:   mov    %r12,%rax    0x00007ffff4688044 <+132>:   pop    %r12    0x00007ffff4688046 <+134>:   retq       0x00007ffff4688047:  nopw   0x0(%rax,%rax,1)  End of assembler dump. 

UPDATE 4:

So, trying to parse through the standard here are the parts of it that seem relevant (C11 draft):

6.3.2.3 Par7 Conversions > Other Operands > Pointers

A pointer to an object type may be converted to a pointer to a different object type. If the resulting pointer is not correctly aligned 68) for the referenced type, the behavior is undefined.
Otherwise, when converted back again, the result shall compare equal to the original pointer. When a pointer to an object is converted to a pointer to a character type,the result points to the lowest addressed byte of the object. Successive increments of the result, up to the size of the object, yield pointers to the remaining bytes of the object.

6.5 Par6 Expressions

The effective type of an object for an access to its stored value is the declared type of the object, if any. 87) If a value is stored into an object having no declared type through an lvalue having a type that is not a character type, then the type of the lvalue becomes the effective type of the object for that access and for subsequent accesses that do not modify the stored value. If a value is copied into an object having no declared type using memcpy or memmove, or is copied as an array of character type, then the effective type of the modified object for that access and for subsequent accesses that do not modify the value is the effective type of the object from which the value is copied, if it has one. For all other accesses to an object having no declared type, the effective type of the object is simply the type of the lvalue used for the access.

87) Allocated objects have no declared type.

IIUC R_alloc returns an offset into a malloced block that is guaranteed to be double aligned, and the size of the block after the offset is of the requested size (there is also allocation before the offset for R specific data). R_alloc casts that pointer to (char *) on return.

Section 6.2.5 Par 29

A pointer to void shall have the same representation and alignment requirements as a pointer to a character type. 48) Similarly, pointers to qualified or unqualified versions of compatible types shall have the same representation and alignment requirements. All pointers to structure types shall have the same representation and alignment requirements as each other.
All pointers to union types shall have the same representation and alignment requirements as each other.
Pointers to other types need not have the same representation or alignment requirements.

48) The same representation and alignment requirements are meant to imply interchangeability asarguments to functions, return values from functions, and members of unions.

So the question is "are we allowed to recast the (char *) to (const char **) and write to it as (const char **)". My reading of the above is that so long as pointers on the systems the code run in have alignment compatible with double alignment, then its okay.

Are we violating "strict aliasing"? i.e.:

6.5 Par 7

An object shall have its stored value accessed only by an lvalue expression that has one of the following types: 88)

— a type compatible with the effective type of the object ...

88) The intent of this list is to specify those circumstances in which an object may or may not be aliased.

So, what should the compiler think the effective type of the object pointed to by res.target (or res.current) is? Presumably the declared type (const char **), or is this actually ambiguous? It feels to me that it isn't in this case only because there is no other 'lvalue' in scope that accesses the same object.

I'll admit I'm struggling mightily to extract sense from these sections of the standard.

like image 468
BrodieG Avatar asked Feb 26 '20 02:02

BrodieG


1 Answers

Summary: This appears to be a bug in gcc, related to string optimization. A self-contained testcase is below. There was initially some doubt as to whether the code is correct, but I think it is.

I have reported the bug as PR 93982. A proposed fix was committed but it does not fix it in all cases, leading to the followup PR 94015 (godbolt link).

You should be able to work around the bug by compiling with the flag -fno-optimize-strlen.


I was able to reduce your test case to the following minimal example (also on godbolt):

struct a {     const char ** target; };  char* R_alloc(void);  struct a foo(void) {     struct a res;     res.target = (const char **) R_alloc();     res.target[0] = "12345678";     res.target[1] = "";     res.target[2] = "";     res.target[3] = "";     res.target[4] = "";     return res; } 

With gcc trunk (gcc version 10.0.1 20200225 (experimental)) and -O2 (all other options turned out to be unnecessary), the generated assembly on amd64 is as follows:

.LC0:         .string "12345678" .LC1:         .string "" foo:         subq    $8, %rsp         call    R_alloc         movq    $.LC0, (%rax)         movq    $.LC1, 16(%rax)         movq    $.LC1, 24(%rax)         movq    $.LC1, 32(%rax)         addq    $8, %rsp         ret 

So you are quite right that the compiler is failing to initialize res.target[1] (note the conspicuous absence of movq $.LC1, 8(%rax)).

It is interesting to play with the code and see what affects the "bug". Perhaps significantly, changing the return type of R_alloc to void * makes it go away, and gives you "correct" assembly output. Maybe less significantly but more amusingly, changing the string "12345678" to be either longer or shorter also makes it go away.


Previous discussion, now resolved - the code is apparently legal.

The question I have is whether your code is actually legal. The fact that you take the char * returned by R_alloc() and cast it to const char **, and then store a const char * seems like it might violate the strict aliasing rule, as char and const char * are not compatible types. There is an exception that allows you to access any object as char (to implement things like memcpy), but this is the other way around, and as best I understand it, that's not allowed. It makes your code produce undefined behavior and so the compiler can legally do whatever the heck it wants.

If this is so, the correct fix would be for R to change their code so that R_alloc() returns void * instead of char *. Then there would be no aliasing problem. Unfortunately, that code is outside your control, and it's not clear to me how you can use this function at all without violating strict aliasing. A workaround might be to interpose a temporary variable, e.g. void *tmp = R_alloc(); res.target = tmp; which solves the problem in the test case, but I'm still not sure if it's legal.

However, I am not sure of this "strict aliasing" hypothesis, because compiling with -fno-strict-aliasing, which AFAIK is supposed to make gcc allow such constructs, does not make the problem go away!


Update. Trying some different options, I found that either -fno-optimize-strlen or -fno-tree-forwprop will result in "correct" code being generated. Also, using -O1 -foptimize-strlen yields the incorrect code (but -O1 -ftree-forwprop does not).

After a little git bisect exercise, the error seems to have been introduced in commit 34fcf41e30ff56155e996f5e04.


Update 2. I tried digging into the gcc source a little bit, just to see what I could learn. (I don't claim to be any sort of compiler expert!)

It looks like the code in tree-ssa-strlen.c is meant to keep track of strings appearing in the program. As near as I can tell, the bug is that in looking at the statement res.target[0] = "12345678"; the compiler conflates the address of the string literal "12345678" with the string itself. (That seems to be related to this suspicious code which was added in the aforementioned commit, where if it tries to count the bytes of a "string" that is actually an address, it instead looks at what that address points to.)

So it thinks that the statement res.target[0] = "12345678", instead of storing the address of "12345678" at the address res.target, is storing the string itself at that address, as if the statement were strcpy(res.target, "12345678"). Note for what's ahead that this would result in the trailing nul being stored at address res.target+8 (at this stage in the compiler, all offsets are in bytes).

Now when the compiler looks at res.target[1] = "", it likewise treats this as if it were strcpy(res.target+8, ""), the 8 coming from the size of a char *. That is, as if it were simply storing a nul byte at address res.target+8. However, the compiler "knows" that the previous statement already stored a nul byte at that very address! As such, this statement is "redundant" and can be discarded (here).

This explains why the string has to be exactly 8 characters long to trigger the bug. (Though other multiples of 8 can also trigger the bug in other situations.)

like image 110
Nate Eldredge Avatar answered Oct 02 '22 06:10

Nate Eldredge