How to fix GCC compilation error when compiling >2 GB of code?

Tags:

I have a huge number of functions totaling around 2.8 GB of object code (unfortunately there's no way around, scientific computing ...)

When I try to link them, I get (expected) relocation truncated to fit: R_X86_64_32S errors, that I hoped to circumvent by specifing the compiler flag -mcmodel=medium. All libraries that are linked in addition that I have control of are compiled with the -fpic flag.

Still, the error persists, and I assume that some libraries I link to are not compiled with PIC.

Here's the error:

/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start': (.text+0x12): relocation truncated to fit: R_X86_64_32S against symbol `__libc_csu_fini'     defined in .text section in /usr/lib64/libc_nonshared.a(elf-init.oS) /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start': (.text+0x19): relocation truncated to fit: R_X86_64_32S against symbol `__libc_csu_init'    defined in .text section in /usr/lib64/libc_nonshared.a(elf-init.oS) /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start': (.text+0x20): undefined reference to `main' /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crti.o: In function    `call_gmon_start': (.text+0x7): relocation truncated to fit: R_X86_64_GOTPCREL against undefined symbol      `__gmon_start__' /usr/lib/gcc/x86_64-redhat-linux/4.1.2/crtbegin.o: In function `__do_global_dtors_aux': crtstuff.c:(.text+0xb): relocation truncated to fit: R_X86_64_PC32 against `.bss'  crtstuff.c:(.text+0x13): relocation truncated to fit: R_X86_64_32 against symbol `__DTOR_END__' defined in .dtors section in /usr/lib/gcc/x86_64-redhat-linux/4.1.2/crtend.o crtstuff.c:(.text+0x19): relocation truncated to fit: R_X86_64_32S against `.dtors' crtstuff.c:(.text+0x28): relocation truncated to fit: R_X86_64_PC32 against `.bss' crtstuff.c:(.text+0x38): relocation truncated to fit: R_X86_64_PC32 against `.bss' crtstuff.c:(.text+0x3f): relocation truncated to fit: R_X86_64_32S against `.dtors' crtstuff.c:(.text+0x46): relocation truncated to fit: R_X86_64_PC32 against `.bss' crtstuff.c:(.text+0x51): additional relocation overflows omitted from the output collect2: ld returned 1 exit status make: *** [testsme] Error 1

And system libraries I link against:

-lgfortran -lm -lrt -lpthread

Any clues where to look for the problem?

EDIT:

First of all, thank you for the discussion...

To clarify a bit, I have hundreds of functions (each approx 1 MB in size in separate object files) like this:

double func1(std::tr1::unordered_map<int, double> & csc,               std::vector<EvaluationNode::Ptr> & ti,               ProcessVars & s) {     double sum, prefactor, expr;      prefactor = +s.ds8*s.ds10*ti[0]->value();     expr =       ( - 5/243.*(s.x14*s.x15*csc[49300] + 9/10.*s.x14*s.x15*csc[49301] +            1/10.*s.x14*s.x15*csc[49302] - 3/5.*s.x14*s.x15*csc[49303] -            27/10.*s.x14*s.x15*csc[49304] + 12/5.*s.x14*s.x15*csc[49305] -            3/10.*s.x14*s.x15*csc[49306] - 4/5.*s.x14*s.x15*csc[49307] +            21/10.*s.x14*s.x15*csc[49308] + 1/10.*s.x14*s.x15*csc[49309] -            s.x14*s.x15*csc[51370] - 9/10.*s.x14*s.x15*csc[51371] -            1/10.*s.x14*s.x15*csc[51372] + 3/5.*s.x14*s.x15*csc[51373] +            27/10.*s.x14*s.x15*csc[51374] - 12/5.*s.x14*s.x15*csc[51375] +            3/10.*s.x14*s.x15*csc[51376] + 4/5.*s.x14*s.x15*csc[51377] -            21/10.*s.x14*s.x15*csc[51378] - 1/10.*s.x14*s.x15*csc[51379] -            2*s.x14*s.x15*csc[55100] - 9/5.*s.x14*s.x15*csc[55101] -            1/5.*s.x14*s.x15*csc[55102] + 6/5.*s.x14*s.x15*csc[55103] +            27/5.*s.x14*s.x15*csc[55104] - 24/5.*s.x14*s.x15*csc[55105] +            3/5.*s.x14*s.x15*csc[55106] + 8/5.*s.x14*s.x15*csc[55107] -            21/5.*s.x14*s.x15*csc[55108] - 1/5.*s.x14*s.x15*csc[55109] -            2*s.x14*s.x15*csc[55170] - 9/5.*s.x14*s.x15*csc[55171] -            1/5.*s.x14*s.x15*csc[55172] + 6/5.*s.x14*s.x15*csc[55173] +            27/5.*s.x14*s.x15*csc[55174] - 24/5.*s.x14*s.x15*csc[55175] +            // ...            ;          sum += prefactor*expr;     // ...     return sum; }

The object s is relatively small and keeps the needed constants x14, x15, ..., ds0, ..., etc. while ti just returns a double from an external library. As you can see, csc[] is a precomputed map of values which is also evaluated in separate object files (again hundreds with about ~1 MB of size each) of the following form:

void cscs132(std::tr1::unordered_map<int,double> & csc, ProcessVars & s) {     {     double csc19295 =       + s.ds0*s.ds1*s.ds2 * ( -            32*s.x12pow2*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -            32*s.x12pow2*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -            32*s.x12pow2*s.x15*s.x35*s.x45*s.mWpowinv2 -            32*s.x12pow2*s.x25*s.x34*s.mbpow2*s.mWpowinv2 -            32*s.x12pow2*s.x25*s.x35*s.mbpow2*s.mWpowinv2 -            32*s.x12pow2*s.x25*s.x35*s.x45*s.mWpowinv2 +            32*s.x12pow2*s.x34*s.mbpow4*s.mWpowinv2 +            32*s.x12pow2*s.x34*s.x35*s.mbpow2*s.mWpowinv2 +            32*s.x12pow2*s.x34*s.x45*s.mbpow2*s.mWpowinv2 +            32*s.x12pow2*s.x35*s.mbpow4*s.mWpowinv2 +            32*s.x12pow2*s.x35pow2*s.mbpow2*s.mWpowinv2 +            32*s.x12pow2*s.x35pow2*s.x45*s.mWpowinv2 +            64*s.x12pow2*s.x35*s.x45*s.mbpow2*s.mWpowinv2 +            32*s.x12pow2*s.x35*s.x45pow2*s.mWpowinv2 -            64*s.x12*s.p1p3*s.x15*s.mbpow4*s.mWpowinv2 +            64*s.x12*s.p1p3*s.x15pow2*s.mbpow2*s.mWpowinv2 +            96*s.x12*s.p1p3*s.x15*s.x25*s.mbpow2*s.mWpowinv2 -            64*s.x12*s.p1p3*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -            64*s.x12*s.p1p3*s.x15*s.x45*s.mbpow2*s.mWpowinv2 -            32*s.x12*s.p1p3*s.x25*s.mbpow4*s.mWpowinv2 +            32*s.x12*s.p1p3*s.x25pow2*s.mbpow2*s.mWpowinv2 -            32*s.x12*s.p1p3*s.x25*s.x35*s.mbpow2*s.mWpowinv2 -            32*s.x12*s.p1p3*s.x25*s.x45*s.mbpow2*s.mWpowinv2 -            32*s.x12*s.p1p3*s.x45*s.mbpow2 +            64*s.x12*s.x14*s.x15pow2*s.x35*s.mWpowinv2 +            96*s.x12*s.x14*s.x15*s.x25*s.x35*s.mWpowinv2 +            32*s.x12*s.x14*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -            32*s.x12*s.x14*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -            64*s.x12*s.x14*s.x15*s.x35pow2*s.mWpowinv2 -            32*s.x12*s.x14*s.x15*s.x35*s.x45*s.mWpowinv2 +            32*s.x12*s.x14*s.x25pow2*s.x35*s.mWpowinv2 +            32*s.x12*s.x14*s.x25*s.x34*s.mbpow2*s.mWpowinv2 -            32*s.x12*s.x14*s.x25*s.x35pow2*s.mWpowinv2 -            // ...             csc.insert(cscMap::value_type(192953, csc19295));     }      {        double csc19296 =      // ... ;         csc.insert(cscMap::value_type(192956, csc19296));     }      // ... }

That's about it. The final step then just consists in calling all those func[i] and summing the result up.

Concerning the fact that this is a rather special and unusual case: Yes, it is. This is what people have to cope with when trying to do high precision computations for particle physics.

EDIT2:

I should also add that x12, x13, etc. are not really constants. They are set to specific values, all those functions are run and the result returned, and then a new set of x12, x13, etc. is chosen to produce the next value. And this has to be done 10⁵ to 10⁶ times...

EDIT3:

Thank you for the suggestions and the discussion so far... I'll try to roll the loops up upon code generation somehow, not sure how to this exactly, to be honest, but this is the best bet.

BTW, I didn't try to hide behind "this is scientific computing -- no way to optimize".
It's just that the basis for this code is something that comes out of a "black box" where I have no real access to and, moreover, the whole thing worked great with simple examples, and I mainly feel overwhelmed with what happens in a real world application...

EDIT4:

So, I have managed to reduce the code size of the csc definitions by about one forth by simplifying expressions in a computer algebra system (Mathematica). I see now also some way to reduce it by another order of magnitude or so by applying some other tricks before generating the code (which would bring this part down to about 100 MB) and I hope this idea works.

Now related to your answers:

I'm trying to roll the loops back up again in the funcs, where a CAS won't help much, but I have already some ideas. For instance, sorting the expressions by the variables like x12, x13,..., parse the cscs with Python and generate tables that relate them to each other. Then I can at least generate these parts as loops. As this seems to be the best solution so far, I mark this as the best answer.

However, I'd like to also give credit to VJo. GCC 4.6 indeed works much better, produces smaller code and is faster. Using the large model works at the code as-is. So technically this is the correct answer, but changing the whole concept is a much better approach.

Thank you all for your suggestions and help. If anyone is interested, I'm going to post the final outcome as soon as I am ready.

REMARKS:

Just some remarks to some other answers: The code I'm trying to run does not originate in an expansion of simple functions/algorithms and stupid unnecessary unrolling. What actually happens is that the stuff we start with is pretty complicated mathematical objects and bringing them to a numerically computable form generates these expressions. The problem lies actually in the underlying physical theory. Complexity of intermediate expressions scales factorially, which is well known, but when combining all of this stuff to something physically measurable -- an observable -- it just boils down to only a handful of very small functions that form the basis of the expressions. (There is definitely something "wrong" in this respect with the general and only available ansatz which is called "perturbation theory") We try to bring this ansatz to another level, which is not feasible analytically anymore and where the basis of needed functions is not known. So we try to brute-force it like this. Not the best way, but hopefully one that helps with our understanding of the physics at hand in the end...

LAST EDIT:

Thanks to all your suggestions, I've managed to reduce the code size considerably, using Mathematica and a modification of the code generator for the funcs somewhat along the lines of the top answer :)

I have simplified the csc functions with Mathematica, bringing it down to 92 MB. This is the irreducible part. The first attempts took forever, but after some optimizations this now runs through in about 10 minutes on a single CPU.

The effect on the funcs was dramatic: The whole code size for them is down to approximately 9 MB, so the code now totals in the 100 MB range. Now it makes sense to turn optimizations on and the execution is quite fast.

Again, thank you all for your suggestions, I've learned a lot.

578

asked Jun 09 '11 17:06

bbtrb

1 Answers

So, you already have a program that produces this text:

prefactor = +s.ds8*s.ds10*ti[0]->value(); expr = ( - 5/243.*(s.x14*s.x15*csc[49300] + 9/10.*s.x14*s.x15*csc[49301] +        1/10.*s.x14*s.x15*csc[49302] - 3/5.*s.x14*s.x15*csc[49303] -...

and

double csc19295 =       + s.ds0*s.ds1*s.ds2 * ( -        32*s.x12pow2*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -        32*s.x12pow2*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -        32*s.x12pow2*s.x15*s.x35*s.x45*s.mWpowinv2 -...

right?

If all your functions have a similar "format" (multiply n numbers m times and add the results - or something similar) then I think you can do this:

change the generator program to output offsets instead of strings (i.e. instead of the string "s.ds0" it will produce offsetof(ProcessVars, ds0)
create an array of such offsets
write an evaluator which accepts the array above and the base addresses of the structure pointers and produces an result

The array+evaluator will represent the same logic as one of your functions, but only the evaluator will be code. The array is "data" and can be either generated at runtime or saved on disk and read i chunks or with a memory mapped file.

For your particular example in func1 imagine how you would rewrite the function via an evaluator if you had access to the base address of s and csc and also a vector like representation of the constants and the offsets you need to add to the base addresses to get to x14, ds8 and csc[51370]

You need to create a new form of "data" that will describe how to process the actual data you pass to your huge number of functions.

174

answered Oct 01 '22 12:10

Andrei

Related questions
                            
                                Can I list-initialize a vector of move-only type?
                            
                                In C++, is it still bad practice to return a vector from a function?
                            
                                C++0x lambda capture by value always const?
                            
                                C++: what regex library should I use? [closed]
                            
                                Where is shared_ptr?
                            
                                How to clear ostringstream [duplicate]
                            
                                What makes this usage of pointers unpredictable?
                            
                                How do I check for C++11 support?
                            
                                Splitting templated C++ classes into .hpp/.cpp files--is it possible?
                            
                                How to disallow temporaries
                            
                                How to calculate a time difference in C++
                            
                                Placement of the asterisk in pointer declarations
                            
                                Why does code mutating a shared variable across threads apparently NOT suffer from a race condition?
                            
                                What's the best way to do a backwards loop in C/C#/C++?
                            
                                Why are C character literals ints instead of chars?
                            
                                Dual emission of constructor symbols
                            
                                Recommended way to insert elements into map [duplicate]
                            
                                Does C++20 mandate source code being stored in files?
                            
                                How to specify preference of library path?
                            
                                error::make_unique is not a member of ‘std’

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to fix GCC compilation error when compiling >2 GB of code?

Tags:

c++

math

gcc

compiler-errors

code-size

bbtrb

People also ask

1 Answers

Andrei

Recent Activity

Donate For Us