I am looking into the LLVM system and I have read through the Getting Started documentation. However, some of the nomenclature (and the wording in the clang example) is still a little confusing. The following terms and commands are all part of the compilation process, and I was wondering if someone might be able to explain them a little better for me: <ul> <li> <code>clang -S</code> vs. <code>clang -c</code> (I know what <code>-c</code> does, but how do the results differ?) * (Edit) </li> <li>LLVM Bitcode vs. LLVM IR (what is the difference?)</li> <li>.ll files vs. .bc files (what are they, how do they differ?)</li> <li> LLVM assembly code vs. native assembly code (is there a difference?)</li> </ul> At a higher level, I understand the overall compilation process, and can track my way through fairly well, I just get stuck at some points where, for example, I am expecting to see "IR", but instead see "bitcode" or "LLVM assembly" which leads me to think I don't understand them nearly as well as I should!

<h3>Clang usage</h3> In general, Clang accepts the same command-line options as GCC. The <code>-c</code> option (only compile and assemble, do not link) and <code>-S</code> option (only compile, do not assemble or link) mean the same thing in both. <h3>LLVM terms regarding the Intermediate Representation</h3> To quote from another answer of mine on this site: <blockquote> LLVM IR is typically stored on disk in either text files with .ll extension or in binary files with .bc extension. Conversion between the two is trivial, and you can just use <code>llvm-dis</code> for bc -> ll and <code>llvm-as</code> for ll -> bc. The binary format is more memory-efficient, while the textual format is human-readable. </blockquote> In additional, there are some commonly-used aliases: <ul> <li>The binary format, stored in .bc files, is also called bitcode (though I've occasionally heard the term "bitcode" applied to the general IR as well)</li> <li>The IR also called LLVM assembly or the LLVM assembly language </li> </ul> In any case, it all means the same thing, under potentially different representations. <h3>Native Assembly</h3> Native assembly is what many typically think about when hearing the term "assembly" - the low-level language with almost 1:1 mapping to your native machine binary, and unlike LLVM assembly, native assembly is very target-dependent (examples are x86 assembly, ARM assembly, etc.). Native assembly is assembled into native binary via an assembler - LLVM does include one, though you can also use other assemblers as well (e.g. <code>gas</code>). Native binary - the result of the assembling process - is of course the (only) language the computer really speaks, and after linking it can be loaded into memory and be ran directly on your hardware.

LLVM and compiler nomenclature

Tags:

assembly

compilation

llvm

clang

llvm-ir

I am looking into the LLVM system and I have read through the Getting Started documentation. However, some of the nomenclature (and the wording in the clang example) is still a little confusing. The following terms and commands are all part of the compilation process, and I was wondering if someone might be able to explain them a little better for me:

clang -S vs. clang -c (I know what -c does, but how do the results differ?) * (Edit)
LLVM Bitcode vs. LLVM IR (what is the difference?)
.ll files vs. .bc files (what are they, how do they differ?)
LLVM assembly code vs. native assembly code (is there a difference?)

At a higher level, I understand the overall compilation process, and can track my way through fairly well, I just get stuck at some points where, for example, I am expecting to see "IR", but instead see "bitcode" or "LLVM assembly" which leads me to think I don't understand them nearly as well as I should!

794

asked Jan 01 '13 04:01

Ephemera

1 Answers

Clang usage

In general, Clang accepts the same command-line options as GCC. The -c option (only compile and assemble, do not link) and -S option (only compile, do not assemble or link) mean the same thing in both.

LLVM terms regarding the Intermediate Representation

To quote from another answer of mine on this site:

LLVM IR is typically stored on disk in either text files with .ll extension or in binary files with .bc extension. Conversion between the two is trivial, and you can just use llvm-dis for bc -> ll and llvm-as for ll -> bc. The binary format is more memory-efficient, while the textual format is human-readable.

In additional, there are some commonly-used aliases:

The binary format, stored in .bc files, is also called bitcode (though I've occasionally heard the term "bitcode" applied to the general IR as well)
The IR also called LLVM assembly or the LLVM assembly language

In any case, it all means the same thing, under potentially different representations.

Native Assembly

Native assembly is what many typically think about when hearing the term "assembly" - the low-level language with almost 1:1 mapping to your native machine binary, and unlike LLVM assembly, native assembly is very target-dependent (examples are x86 assembly, ARM assembly, etc.). Native assembly is assembled into native binary via an assembler - LLVM does include one, though you can also use other assemblers as well (e.g. gas).

Native binary - the result of the assembling process - is of course the (only) language the computer really speaks, and after linking it can be loaded into memory and be ran directly on your hardware.

198

answered Oct 14 '22 15:10

Oak

Related questions
                            
                                Optimizing an arithmetic coder
                            
                                How can caches be defeated?
                            
                                C++ Tail recursion using 64-bit variables
                            
                                dword ptr usage confusion
                            
                                C vs assembler vs NEON performance
                            
                                How does MOVSX assembly instruction work?
                            
                                Why does the BIOS entry point start with a WBINVD instruction?
                            
                                Shellcode for a simple stack overflow: Exploited program with shell terminates directly after execve("/bin/sh")
                            
                                Local and static variables in C
                            
                                Why do some SSE "mov" instructions specify that they move floating-point values?
                            
                                Why do we need to define .data and .text section in assembly?
                            
                                Is it possible to call a non-exported function that resides in an exe?
                            
                                Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures?
                            
                                Difference between lea and offset
                            
                                When are GAS ELF the directives .type, .thumb, .size and .section needed?
                            
                                How does one do integer (signed or unsigned) division on ARM?
                            
                                "cpuid" before "rdtsc"
                            
                                load warning: cannot find entry symbol _start
                            
                                How to get address of base stack pointer
                            
                                What is %gs in Assembly

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With