An assembler takes an assembly code as input and produces machine code as output. So does it mean that an assembler also has to do lexical analysis and syntax analysis on the code?
Like for an example, it will require some way to distinguish between MOV as an instruction and MOVXYZ as a label.
Take for example the following piece of code compatible with 8086.
MOV MOVXYZ,013h
MOV BX,023h
ADD BX,MOVXYZ
If it does require another round of lexical analysis and syntax analysis, then why at all have the assembly as an intermediate step in compilation?
Edit:
the assembler gets the assembly code as input
MOV AX,MOVXYZ
ADD AX,BX
It essentially is a file with characters. My question is, that if not lexical analysis, how does it distinguish the "MOV" from the "MOVS" ?
An assembler takes an assembly code as input and produces machine code as output. So does it mean that an assembler also has to do lexical analysis and syntax analysis on the code?
Yes. Assembler can be thought of as a programming language just like any other, albeit a very low-level one.
Like for an example, it will require some way to distinguish between MOV as an instruction and MOVXYZ as a label.
Indeed
If it does require another round of lexical analysis and syntax analysis, then why at all have the assembly as an intermediate step in compilation?
Like you say, it does require analysis, and in fact most compilers do not use assembler as an intermediate step but emit binary code directly into some kind of object format, which is later fed to the linker stage.
As a separate question :if a three-address code is generated as the intermediate form, then its optimisation (done by compiler from three-address code to optimised three-address code) would also require lexical analysis.
Correct, if the 3-address code was actually emitted as text, but in reality it is typically emitted into internal tables in binary form and is therefore effectively already parsed/analyzed.
So does it mean that an assembler also has to do lexical analysis and syntax analysis on the code?
Only in a very limited way. It has to do it in the sense that is has to extract the opcodes and arguments and so on, which means it is turning a sequence of characters into an internal representation that it can actually work with. But unlike "real parsers", the parsers assemblers often work with plain old string processing instead of finite state machines and things like that. You'll often see things like reading a line, splitting it, interpreting the first part as the opcode - that's not how proper lexical analysis works, but it is effectively extracting tokens.
Removed from the question, but.. yes, assemblers may do some optimization as well. Nothing of the sort you'd expect a compiler to do though. But sometimes there are several ways to translate a mnemonic into an actual instruction, and then it may matter which it chooses, and that choice may be non-trivial. An example of that is branch sizes on x86, there's the 2-byte 7x ofs8 form, which has a limited range, and the 6-byte 0F 8x ofs32.
In order to find the addresses of instructions and labels (and thus determine which branch you can/have to use) it has to know the size of instruction, however it first needed this information in order to decide the sizes of branches. A common way to resolve this is to assume the small size first, then iteratively change any branch that isn't reaching its target to the bigger variant (this may then cause other branches to get out of range, and so on).
Also some assembly languages have "pseudo instructions", which are written as a simple mnemonic but assembled to two or more actual instructions. The choice of instructions may depend on the operands and so on (in that case it's effectively optimizing for a specific case). Or, more commonly, it could just be a pre-determined macro. MIPS and ARM both have pseudo instructions of the last type.
That was the weird side of assembling, most of what they do is just taking an instruction and encoding it. For example if you write add eax, edx, it extracts the tokens add, eax and edx, recognizes that this is an add instruction with operands that look like r32, and then it can look up in a big table (or giant switch or decision tree) how to encode it. Turns out there are two encodings that fit that pattern, 01 /r and 03 /r. So you could get 01 D0 or 03 C2, depending on some choice that the author of the assembler made. If it's assembling 16bit code, it would also emit operand size overrides.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With