Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How Do C++ Compilers Merge Identical String Literals

How does compiler (MS Visual C++ 2010) combine identical string literals in different cpp source files? For example, if I have the string literal "hello world\n" in src1.cpp and src2.cpp respectively. The compiled exe file will have only 1 "hello world" string literal probably in the constant/readonly section. Is this task done by the linker?

What I hope to achieve is that I got some modules written in assembly to be used by C++ modules. And these assembly modules contain many long string literal definitions. I know the string literals are identical to some other string literals in the C++ source. If I link my assembly generated obj code with the compiler generated obj code, would these string literals be merged by the linker to remove redundant strings as is the case when all modules are in C++?

like image 954
JavaMan Avatar asked Jun 08 '11 16:06

JavaMan


People also ask

How does compiler process string literal?

The compiler scans the source code file, looks for, and stores all occurrences of string literals. It can use a mechanism such as a lookup table to do this. It then runs through the list and assigns the same address to all identical string literals.

Are string literals modifiable in C?

The only difference is that you cannot modify string literals, whereas you can modify arrays. Functions that take a C-style string will be just as happy to accept string literals unless they modify the string (in which case your program will crash).

Is it possible to modify a string literal?

The behavior is undefined if a program attempts to modify any portion of a string literal. Modifying a string literal frequently results in an access violation because string literals are typically stored in read-only memory.

What is type of C string literal?

In C the type of a string literal is a char[]. In C++, an ordinary string literal has type 'array of n const char'. For example, The type of the string literal "Hello" is "array of 6 const char".


2 Answers

(Note the following applies only to MSVC)

My first answer was misleading since I thought that the literal merging was magic done by the linker (and so that the /GF flag would only be needed by the linker).

However, that was a mistake. It turns out the linker has little special involvement in merging string literals - what happens is that when the /GF option is given to the compiler, it puts string literals in a "COMDAT" section of the object file with an object name that's based on the contents of the string literal. So the /GF flag is needed for the compile step, not for the link step.

When you use the /GF option, the compiler places each string literal in the object file in a separate section as a COMDAT object. The various COMDAT objects with the same name will be folded by the linker (I'm not exactly sure about the semantics of COMDAT, or what the linker might do if objects with the same name have different data). So a C file that contains

char* another_string = "this is a string";

Will have something like the following in the object file:

SECTION HEADER #3
  .rdata name
       0 physical address
       0 virtual address
      11 size of raw data
     147 file pointer to raw data (00000147 to 00000157)
       0 file pointer to relocation table
       0 file pointer to line numbers
       0 number of relocations
       0 number of line numbers
40301040 flags
         Initialized Data
         COMDAT; sym= "`string'" (??_C@_0BB@LFDAHJNG@this?5is?5a?5string?$AA@)
         4 byte align
         Read Only

RAW DATA #3
  00000000: 74 68 69 73 20 69 73 20 61 20 73 74 72 69 6E 67  this is a string
  00000010: 00      

with the relocation table wiring up the another_string1 variable name to the literal data.

Note that the name of the string literal object is clearly based on the contents of the literal string, but with some sort of mangling. The mangling scheme has been partially documented on Wikipedia (see "String constants").

Anyway, if you want literals in an assembly file to be treated in the same manner, you'd need to arrange for the literals to be placed in the object file in the same manner. I honestly don't know what (if any) mechanism the assembler might have for that. Placing an object in a "COMDAT" section is probably pretty easy - getting the name of the object to be based on the string contents (and mangled in the appropriate manner) is another story.

Unless there's some assembly directive/keyword that specifically supports this scenario, I think you might be out of luck. There certainly might be one, but I'm sufficiently rusty with ml.exe to have no idea, and a quick look at the skimpy MSDN docs for ml.exe didn't have anything jump out.

However, if you're willing to put the sting literals in a C file and refer to them in your assembly code via externs, it should work. However, that's essentially what Mark Ransom advocates in his comments to the question.

like image 176
Michael Burr Avatar answered Oct 29 '22 04:10

Michael Burr


Yes, the process of merging the resources is done by the linker.

If your resources in your compiled assembly code are properly tagged as resources, the linker will be able to merge them with compiled C code.

like image 39
Paul Sonier Avatar answered Oct 29 '22 02:10

Paul Sonier