Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

clang compiler produces different object files from same sources

I have a simple hello world objective-c lib:

hello.m:

#import <Foundation/Foundation.h>
#import "hello.h"

void sayHello()
{
    #ifdef FRENCH
    NSString *helloWorld = @"Hello World!\n";
    #else
    NSString *helloWorld = @"Bonjour Monde!\n";
    #endif
    NSFileHandle *stdout = [NSFileHandle fileHandleWithStandardOutput];
    NSData *strData = [helloWorld dataUsingEncoding: NSASCIIStringEncoding];
    [stdout writeData: strData];
}

the hello.h file looks like this:

int main (int argc, const char * argv[]);
int sum(int a, int b);
void sayHello();

This compiles just fine on osx and linux using clang and gcc.

Now my question:

When running a clean compile against hello.m multiple times with clang on ubuntu the generated hello.o can differ. This seems not related to a timestamp, because even after a second or more, the generated .o file can have the same checksum. From my naive point of view, this seems like a complete random/unpredicatable behaviour.

I ran the compilation with the -Sto inspect the generated assembler code. The assembler code also differs (as expected). The diff file of comparing the assembler code can be found here: http://pastebin.com/uY1LERGX

From a first look it just looks like the sorting is different in the assembler code.

This does not happen when compiling it with gcc.

Is there a way to tell clang to generate exactly the same .o file like gcc does?

clang --version: 
Ubuntu clang version 3.0-6ubuntu3 (tags/RELEASE_30/final) (based on LLVM 3.0)
like image 900
Rene Groeschke Avatar asked Mar 20 '23 14:03

Rene Groeschke


1 Answers

The feature when compiler always produce the same code is called Reproducible Builds or deterministic compilation.

One of possible sources of compiler's output instability is ASLR (Address space layout randomization). Sometimes compiler, or some libraries used by it, may read object address and use them, for example as keys of hashes or maps; or when sorting objects according to their addresses. When compiler is iterating over the hash, it will read objects in the order that depends on addresses of objects, and ASLR will place objects in different orders. The effect of such may looks like your reordered symbols (.quads in your diffs)

You can disable Linux ASLR globally with echo 0 | sudo tee /proc/sys/kernel/randomize_va_space. Local way of disabling ASLR in Linux is

 setarch `uname -m` -R /bin/bash`

man page of setarch says: -R, "--addr-no-randomize" Disables randomization of the virtual address space (turns on ADDR_NO_RANDOMIZE).

For OS X 10.6 there is DYLD_NO_PIE environment variable (check man dyld, possible usage in bash export DYLD_NO_PIE=1); in 10.7 and newer there is --no_pie build flag to be used in building the LLVM itself or by setting _POSIX_SPAWN_DISABLE_ASLR which should be used in posix_spawnattr_setflags before starting the llvm; or by using in 10.7+ the script http://src.chromium.org/viewvc/chrome/trunk/src/build/mac/change_mach_o_flags.py with --no-pie option to clear PIE flag from llvm binaries (thanks to asan people).


There were some errors in clang and llvm which prevents/prevented them to be completely deterministic, for example:

  • [cfe-dev] clang: not deterministic anymore? - Nov 3 2009, indeterminism was detected on code from LLVM bug 5355. Author says that indeterminism was present only with -g option enabled
  • [LLVMdev] Deterministic code generation and llvm::Iterators (2010)
  • [llvm-commits] Fix some TableGen non-deterministic behavior. (Sep 2012)
  • r196520 - Fix non-deterministic behavior. - SLPVectorizer was fixed into deterministic only at Dec 5, 2013 (replaced SmallSet with VectorSet)
  • 190793 - TableGen: give asm match classes deterministic order. "TableGen was sorting the entries in some of its internal data structures by pointer." - Sep 16, 2013
  • LLVM bug 14901 is the case when order of compiler warnings was Non-deterministic (Jan 2013).

The patch from 14901 contains comments about non-deterministic iterating over llvm::DenseMap:

-  typedef llvm::DenseMap<const VarDecl *, std::pair<UsesVec*, bool> > UsesMap;
+  typedef std::pair<UsesVec*, bool> MappedType;
+  // Prefer using MapVector to DenseMap, so that iteration order will be
+  // the same as insertion order. This is needed to obtain a deterministic
+  // order of diagnostics when calling flushDiagnostics().
+  typedef llvm::MapVector<const VarDecl *, MappedType> UsesMap;
...
-    // FIXME: This iteration order, and thus the resulting diagnostic order,
-    //        is nondeterministic.

Documentation of LLVM says that there are non-deterministic and deterministic variants of several internal containers, like Map vs MapVector: trunk/docs/ProgrammersManual.rst:

1164    The difference between SetVector and other sets is that the order of iteration
1165    is guaranteed to match the order of insertion into the SetVector.  This property
1166    is really important for things like sets of pointers.  Because pointer values
1167    are non-deterministic (e.g. vary across runs of the program on different
1168    machines), iterating over the pointers in the set will not be in a well-defined
1169    order.
1170    
1171    The drawback of SetVector is that it requires twice as much space as a normal
1172    set and has the sum of constant factors from the set-like container and the
1173    sequential container that it uses.  Use it **only** if you need to iterate over
1174    the elements in a deterministic order. 

...

1277    StringMap iteratation order, however, is not guaranteed to be deterministic, so
1278    any uses which require that should instead use a std::map.
...

1364    ``MapVector<KeyT,ValueT>`` provides a subset of the DenseMap interface.  The
1365    main difference is that the iteration order is guaranteed to be the insertion
1366    order, making it an easy (but somewhat expensive) solution for non-deterministic
1367    iteration over maps of pointers.

It is possible that some authors of LLVM thought that in their code there was no need to save determinism in iteration order. For example, there are comments in ARMTargetStreamer about usage of MapVector for ConstantPools (ARMTargetStreamer.cpp - class AssemblerConstantPools). But how can we sure that all usages of non-deterministic containers like DenseMap will not affect output of compiler? There are tens loops iterating over DenseMap: "DenseMap.*const_iterator" regex in codesearch.debian.net

Your version of LLVM and clang (3.0, from 2011-11-30) is clearly too old to have all determinism enhances from 2012 and 2013 years (some are listed in my answer). You should update your LLVM and Clang, then recheck your program for deterministic compilation, then locate non-determinism in shorter and easier to reproduce examples (e.g. save bc - bitcode - from middle stages), then you can post a bug in LLVM bugzilla.

like image 111
osgx Avatar answered Mar 23 '23 04:03

osgx