I am trying to compile a large C file (specifically for MATLAB mexing). The C file is around 20 MB (available from the GCC bug tracker if you want to play around with it).
Here is the command I am running and the output to screen, below. This has been running for hours, and as you can see, optimization is already disabled (-O0). Why is this so slow? Is there a way I can make this faster?
(For reference: Ubuntu 12.04 (Precise Pangolin) 64 bit and GCC 4.7.3)
/usr/bin/gcc -c -DMX_COMPAT_32 -D_GNU_SOURCE -DMATLAB_MEX_FILE -I"/usr/local/MATLAB/R2015a/extern/include" -I"/usr/local/MATLAB/R2015a/simulink/include" -ansi -fexceptions -fPIC -fno-omit-frame-pointer -pthread -O0 -DNDEBUG path/to/test4.c -o /tmp/mex_198714460457975_3922/test4.o -v
Using built-in specs.
COLLECT_GCC=/usr/bin/gcc
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 4.7.3-2ubuntu1~12.04' --with-bugurl=file:///usr/share/doc/gcc-4.7/README.Bugs --enable-languages=c,c++,go,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.7 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.7 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --with-system-zlib --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.7.3 (Ubuntu/Linaro 4.7.3-2ubuntu1~12.04)
COLLECT_GCC_OPTIONS='-c' '-D' 'MX_COMPAT_32' '-D' '_GNU_SOURCE' '-D' 'MATLAB_MEX_FILE' '-I' '/usr/local/MATLAB/R2015a/extern/include' '-I' '/usr/local/MATLAB/R2015a/simulink/include' '-ansi' '-fexceptions' '-fPIC' '-fno-omit-frame-pointer' '-pthread' '-O0' '-D' 'NDEBUG' '-o' '/tmp/mex_198714460457975_3922/test4.o' '-v' '-mtune=generic' '-march=x86-64'
/usr/lib/gcc/x86_64-linux-gnu/4.7/cc1 -quiet -v -I /usr/local/MATLAB/R2015a/extern/include -I /usr/local/MATLAB/R2015a/simulink/include -imultilib . -imultiarch x86_64-linux-gnu -D_REENTRANT -D MX_COMPAT_32 -D _GNU_SOURCE -D MATLAB_MEX_FILE -D NDEBUG path/to/test4.c -quiet -dumpbase test4.c -mtune=generic -march=x86-64 -auxbase-strip /tmp/mex_198714460457975_3922/test4.o -O0 -ansi -version -fexceptions -fPIC -fno-omit-frame-pointer -fstack-protector -o /tmp/ccxDOA5f.s
GNU C (Ubuntu/Linaro 4.7.3-2ubuntu1~12.04) version 4.7.3 (x86_64-linux-gnu)
compiled by GNU C version 4.7.3, GMP version 5.0.2, MPFR version 3.1.0-p3, MPC version 0.9
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
ignoring nonexistent directory "/usr/local/include/x86_64-linux-gnu"
ignoring nonexistent directory "/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../../x86_64-linux-gnu/include"
#include "..." search starts here:
#include <...> search starts here:
/usr/local/MATLAB/R2015a/extern/include
/usr/local/MATLAB/R2015a/simulink/include
/usr/lib/gcc/x86_64-linux-gnu/4.7/include
/usr/local/include
/usr/lib/gcc/x86_64-linux-gnu/4.7/include-fixed
/usr/include/x86_64-linux-gnu
/usr/include
End of search list.
GNU C (Ubuntu/Linaro 4.7.3-2ubuntu1~12.04) version 4.7.3 (x86_64-linux-gnu)
compiled by GNU C version 4.7.3, GMP version 5.0.2, MPFR version 3.1.0-p3, MPC version 0.9
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
Compiler executable checksum: c119948b394d79ea05b6b3986ab084cf
EDIT: a follow-on: I followed chqrlie's advice and tcc compiled my function in <5 seconds (I had to remove the -ansi flag only and turn "gcc" to "tcc"), which is pretty remarkable, really. I can only imagine the complexity of GCC.
When trying to then mex it, however, there is one other command mex typically needs. The second command is typically:
/usr/bin/gcc -pthread -Wl,--no-undefined -Wl,-rpath-link,/usr/local/MATLAB/R2015a/bin/glnxa64 -shared -O -Wl,--version-script,"/usr/local/MATLAB/R2015a/extern/lib/glnxa64/mexFunction.map" /tmp/mex_61853296369424_4031/test4.o -L"/usr/local/MATLAB/R2015a/bin/glnxa64" -lmx -lmex -lmat -lm -lstdc++ -o test4.mexa64
I cannot run this with tcc as some of these flags are not compatible. If I try to run this second compilation step with GCC, I get:
/usr/bin/ld: test4.o: relocation R_X86_64_PC32 against undefined symbol `mxGetPr' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: Bad value
collect2: error: ld returned 1 exit status
EDIT: The solution appears to be clang. tcc can compile the file, but the arguments in the second step in mexing are incompatible with tcc's argument options. Clang is very fast and produces a nice, small, optimized file.
Nearly the entire file is one expression, the assignment to double f[24] = ...
. That's going to generate a gigantic abstract syntax tree tree. I'd be surprised if anything but a specialized compiler could handle that efficiently.
The 20 megabyte file itself may be fine, but the one giant expression may be what is causing the issue. Try as a preliminary step, splitting the line into double f[24] = {0}
and then 24 assignments of f[0] = ...; f[1] = ...
and see what happens. Worst case, you can split the 24 assignments into 24 functions, each in its own .c
file, and compile them separately. This won't reduce the size of the AST, it will just reorganize it, but GCC is probably more optimized at handling many statements which together add up to a lot of code, vs. one huge expression.
The ultimate approach would be to generate the code in a more optimized manner. If I search for s4*s5*s6
, for example, I get 77,783 hits. These s[4-6]
variables don't change. You should generate a temporary variable, double _tmp1 = s4*s5*s6;
and then use that instead of repeating the expression. You've just eliminated 311,132 nodes from your abstract syntax tree (assuming s4*s5*s6
is 5 nodes and _tmp1
is one node). That's that much less processing GCC has to do. This should also generate faster code (you won't repeat the same multiplication 77,783 times).
If you do this in a smart way in a recursive manner (e.g. s4*s5*s6
--> _tmp1
, (c4*c6+s4*s5*s6)
--> (c4*c6+_tmp1)
--> _tmp2
, c5*s6*(c4*c6+s4*s5*s6)
--> c5*s6*_tmp2
-> _tmp3
, etc...), you can probably eliminate most of the size of the generated code.
Upon testing, I found that the Clang compiler seems to have less problems compiling large files. Although Clang consumed almost a gigabyte of memory during compilation, it successfully turned OP's source code form into a 70 kB object file. This works for all optimization levels I tested.
gcc was also able to compile this file quickly and without consuming too much memory if optimization is turned on. This bug in gcc comes from the large expression in OPs code which places a huge burden on the register allocator. With optimizations turned on, the compiler performs an optimization called common subexpression elimination which is able to remove a lot of redundancy from OPs code, reducing both compilation time and object file size to manageable values.
Here are some tests with the testcase from the aforementioned bug report:
$ time gcc5 -O3 -c -o testcase.gcc5-O3.o testcase.c
real 0m39,30s
user 0m37,85s
sys 0m1,42s
$ time gcc5 -O0 -c -o testcase.gcc5-O0.o testcase.c
real 23m33,34s
user 23m27,07s
sys 0m5,92s
$ time tcc -c -o testcase.tcc.o testcase.c
real 0m2,60s
user 0m2,42s
sys 0m0,17s
$ time clang -O3 -c -o testcase.clang-O3.o testcase.c
real 0m13,71s
user 0m12,55s
sys 0m1,16s
$ time clang -O0 -c -o testcase.clang-O0.o testcase.c
real 0m17,63s
user 0m16,14s
sys 0m1,49s
$ time clang -Os -c -o testcase.clang-Os.o testcase.c
real 0m14,88s
user 0m13,73s
sys 0m1,11s
$ time clang -Oz -c -o testcase.clang-Oz.o testcase.c
real 0m13,56s
user 0m12,45s
sys 0m1,09
This is the resulting object file size:
text data bss dec hex filename
39101286 0 0 39101286 254a366 testcase.clang-O0.o
72161 0 0 72161 119e1 testcase.clang-O3.o
72087 0 0 72087 11997 testcase.clang-Os.o
72087 0 0 72087 11997 testcase.clang-Oz.o
38683240 0 0 38683240 24e4268 testcase.gcc5-O0.o
87500 0 0 87500 155cc testcase.gcc5-O3.o
78239 0 0 78239 1319f testcase.gcc5-Os.o
69210504 3170616 0 72381120 45072c0 testcase.tcc.o
Try Fabrice Bellard's tiny C compiler tcc
from http://tinycc.org:
chqrlie$ time tcc -c test4.c
real 0m1.336s
user 0m1.248s
sys 0m0.084s
chqrlie$ size test4.o
text data bss dec hex filename
38953877 3170632 0 42124509 282c4dd test4.o
Yes, that's 1.336 seconds on a pretty basic PC!
Of course I cannot test the resulting executable, but the object file should be linkable with the rest of your program and libraries.
For this test, I used a dummy version of file mex.h
:
typedef struct mxArray mxArray;
double *mxGetPr(const mxArray*);
enum { mxREAL = 0 };
mxArray *mxCreateDoubleMatrix(int nx, int ny, int type);
gcc
still has not finished compiling...
EDIT: gcc
managed to hog my Linux box so badly I cannot connect to it anymore:(
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With