Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GCC compilation very slow (large file)

Tags:

c

gcc

mex

I am trying to compile a large C file (specifically for MATLAB mexing). The C file is around 20 MB (available from the GCC bug tracker if you want to play around with it).

Here is the command I am running and the output to screen, below. This has been running for hours, and as you can see, optimization is already disabled (-O0). Why is this so slow? Is there a way I can make this faster?

(For reference: Ubuntu 12.04 (Precise Pangolin) 64 bit and GCC 4.7.3)

/usr/bin/gcc -c -DMX_COMPAT_32   -D_GNU_SOURCE -DMATLAB_MEX_FILE  -I"/usr/local/MATLAB/R2015a/extern/include" -I"/usr/local/MATLAB/R2015a/simulink/include" -ansi -fexceptions -fPIC -fno-omit-frame-pointer -pthread -O0 -DNDEBUG path/to/test4.c -o /tmp/mex_198714460457975_3922/test4.o -v
Using built-in specs.
COLLECT_GCC=/usr/bin/gcc
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 4.7.3-2ubuntu1~12.04' --with-bugurl=file:///usr/share/doc/gcc-4.7/README.Bugs --enable-languages=c,c++,go,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.7 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.7 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --with-system-zlib --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.7.3 (Ubuntu/Linaro 4.7.3-2ubuntu1~12.04)
COLLECT_GCC_OPTIONS='-c' '-D' 'MX_COMPAT_32' '-D' '_GNU_SOURCE' '-D' 'MATLAB_MEX_FILE' '-I' '/usr/local/MATLAB/R2015a/extern/include' '-I' '/usr/local/MATLAB/R2015a/simulink/include' '-ansi' '-fexceptions' '-fPIC' '-fno-omit-frame-pointer' '-pthread' '-O0' '-D' 'NDEBUG' '-o' '/tmp/mex_198714460457975_3922/test4.o' '-v' '-mtune=generic' '-march=x86-64'
 /usr/lib/gcc/x86_64-linux-gnu/4.7/cc1 -quiet -v -I /usr/local/MATLAB/R2015a/extern/include -I /usr/local/MATLAB/R2015a/simulink/include -imultilib . -imultiarch x86_64-linux-gnu -D_REENTRANT -D MX_COMPAT_32 -D _GNU_SOURCE -D MATLAB_MEX_FILE -D NDEBUG path/to/test4.c -quiet -dumpbase test4.c -mtune=generic -march=x86-64 -auxbase-strip /tmp/mex_198714460457975_3922/test4.o -O0 -ansi -version -fexceptions -fPIC -fno-omit-frame-pointer -fstack-protector -o /tmp/ccxDOA5f.s
GNU C (Ubuntu/Linaro 4.7.3-2ubuntu1~12.04) version 4.7.3 (x86_64-linux-gnu)
    compiled by GNU C version 4.7.3, GMP version 5.0.2, MPFR version 3.1.0-p3, MPC version 0.9
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
ignoring nonexistent directory "/usr/local/include/x86_64-linux-gnu"
ignoring nonexistent directory "/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../../x86_64-linux-gnu/include"
#include "..." search starts here:
#include <...> search starts here:
 /usr/local/MATLAB/R2015a/extern/include
 /usr/local/MATLAB/R2015a/simulink/include
 /usr/lib/gcc/x86_64-linux-gnu/4.7/include
 /usr/local/include
 /usr/lib/gcc/x86_64-linux-gnu/4.7/include-fixed
 /usr/include/x86_64-linux-gnu
 /usr/include
End of search list.
GNU C (Ubuntu/Linaro 4.7.3-2ubuntu1~12.04) version 4.7.3 (x86_64-linux-gnu)
    compiled by GNU C version 4.7.3, GMP version 5.0.2, MPFR version 3.1.0-p3, MPC version 0.9
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
Compiler executable checksum: c119948b394d79ea05b6b3986ab084cf

EDIT: a follow-on: I followed chqrlie's advice and tcc compiled my function in <5 seconds (I had to remove the -ansi flag only and turn "gcc" to "tcc"), which is pretty remarkable, really. I can only imagine the complexity of GCC.

When trying to then mex it, however, there is one other command mex typically needs. The second command is typically:

/usr/bin/gcc -pthread -Wl,--no-undefined -Wl,-rpath-link,/usr/local/MATLAB/R2015a/bin/glnxa64 -shared  -O -Wl,--version-script,"/usr/local/MATLAB/R2015a/extern/lib/glnxa64/mexFunction.map" /tmp/mex_61853296369424_4031/test4.o   -L"/usr/local/MATLAB/R2015a/bin/glnxa64" -lmx -lmex -lmat -lm -lstdc++ -o test4.mexa64

I cannot run this with tcc as some of these flags are not compatible. If I try to run this second compilation step with GCC, I get:

/usr/bin/ld: test4.o: relocation R_X86_64_PC32 against undefined symbol `mxGetPr' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: Bad value
collect2: error: ld returned 1 exit status

EDIT: The solution appears to be clang. tcc can compile the file, but the arguments in the second step in mexing are incompatible with tcc's argument options. Clang is very fast and produces a nice, small, optimized file.

like image 631
user650261 Avatar asked Oct 30 '15 19:10

user650261


3 Answers

Nearly the entire file is one expression, the assignment to double f[24] = .... That's going to generate a gigantic abstract syntax tree tree. I'd be surprised if anything but a specialized compiler could handle that efficiently.

The 20 megabyte file itself may be fine, but the one giant expression may be what is causing the issue. Try as a preliminary step, splitting the line into double f[24] = {0} and then 24 assignments of f[0] = ...; f[1] = ... and see what happens. Worst case, you can split the 24 assignments into 24 functions, each in its own .c file, and compile them separately. This won't reduce the size of the AST, it will just reorganize it, but GCC is probably more optimized at handling many statements which together add up to a lot of code, vs. one huge expression.

The ultimate approach would be to generate the code in a more optimized manner. If I search for s4*s5*s6, for example, I get 77,783 hits. These s[4-6] variables don't change. You should generate a temporary variable, double _tmp1 = s4*s5*s6; and then use that instead of repeating the expression. You've just eliminated 311,132 nodes from your abstract syntax tree (assuming s4*s5*s6 is 5 nodes and _tmp1 is one node). That's that much less processing GCC has to do. This should also generate faster code (you won't repeat the same multiplication 77,783 times).

If you do this in a smart way in a recursive manner (e.g. s4*s5*s6 --> _tmp1, (c4*c6+s4*s5*s6) --> (c4*c6+_tmp1) --> _tmp2, c5*s6*(c4*c6+s4*s5*s6) --> c5*s6*_tmp2 -> _tmp3, etc...), you can probably eliminate most of the size of the generated code.

like image 89
Claudiu Avatar answered Oct 13 '22 18:10

Claudiu


Upon testing, I found that the Clang compiler seems to have less problems compiling large files. Although Clang consumed almost a gigabyte of memory during compilation, it successfully turned OP's source code form into a 70 kB object file. This works for all optimization levels I tested.

gcc was also able to compile this file quickly and without consuming too much memory if optimization is turned on. This bug in gcc comes from the large expression in OPs code which places a huge burden on the register allocator. With optimizations turned on, the compiler performs an optimization called common subexpression elimination which is able to remove a lot of redundancy from OPs code, reducing both compilation time and object file size to manageable values.

Here are some tests with the testcase from the aforementioned bug report:

$ time gcc5 -O3 -c -o testcase.gcc5-O3.o testcase.c
real    0m39,30s
user    0m37,85s
sys     0m1,42s
$ time gcc5 -O0 -c -o testcase.gcc5-O0.o testcase.c
real    23m33,34s
user    23m27,07s
sys     0m5,92s
$ time tcc -c -o testcase.tcc.o testcase.c
real    0m2,60s
user    0m2,42s
sys     0m0,17s
$ time clang -O3 -c -o testcase.clang-O3.o testcase.c
real    0m13,71s
user    0m12,55s
sys     0m1,16s
$ time clang -O0 -c -o testcase.clang-O0.o testcase.c
real    0m17,63s
user    0m16,14s
sys     0m1,49s
$ time clang -Os -c -o testcase.clang-Os.o testcase.c
real    0m14,88s
user    0m13,73s
sys 0m1,11s
$ time clang -Oz -c -o testcase.clang-Oz.o testcase.c
real    0m13,56s
user    0m12,45s
sys     0m1,09

This is the resulting object file size:

    text       data     bss      dec        hex filename
39101286          0       0 39101286    254a366 testcase.clang-O0.o
   72161          0       0    72161      119e1 testcase.clang-O3.o
   72087          0       0    72087      11997 testcase.clang-Os.o
   72087          0       0    72087      11997 testcase.clang-Oz.o
38683240          0       0 38683240    24e4268 testcase.gcc5-O0.o
   87500          0       0    87500      155cc testcase.gcc5-O3.o
   78239          0       0    78239      1319f testcase.gcc5-Os.o
69210504    3170616       0 72381120    45072c0 testcase.tcc.o
like image 38
fuz Avatar answered Oct 13 '22 18:10

fuz


Try Fabrice Bellard's tiny C compiler tcc from http://tinycc.org:

chqrlie$ time tcc -c test4.c

real    0m1.336s
user    0m1.248s
sys     0m0.084s

chqrlie$ size test4.o
   text    data     bss     dec     hex filename
38953877        3170632       0 42124509        282c4dd test4.o

Yes, that's 1.336 seconds on a pretty basic PC!

Of course I cannot test the resulting executable, but the object file should be linkable with the rest of your program and libraries.

For this test, I used a dummy version of file mex.h:

typedef struct mxArray mxArray;
double *mxGetPr(const mxArray*);
enum { mxREAL = 0 };
mxArray *mxCreateDoubleMatrix(int nx, int ny, int type);

gcc still has not finished compiling...

EDIT: gcc managed to hog my Linux box so badly I cannot connect to it anymore:(

like image 41
chqrlie Avatar answered Oct 13 '22 19:10

chqrlie