Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

tutorials on first pass and second pass of assembler

Are there good tutorials around that explain about the first and second pass of assembler along with their algorithms ? I searched a lot about them but haven't got satisfying results. Please link the tutorials if any.

like image 683
program-o-steve Avatar asked Sep 17 '25 06:09

program-o-steve


2 Answers

A good place to start is David Solomon's book, Assemblers and Loaders. It's an older book, but the information is still relevant.

You can download a PDF of the book.

like image 58
Jim Mischel Avatar answered Sep 19 '25 19:09

Jim Mischel


Dont know of any tutorials, there really isnt much to it.

one:
  inc r0
  cmp r0,0
  jnz one
  call fun
  add r0,7
  jmp more_fun
fun:
  mov r1,r0
  ret
more_fun:

The assembler/software, like a human is going to read the source file from top to bottom, byte 0 in the file to the end. there are no hard and fast rules as to what you complete in each pass, and it is not necessarily a pass "on the file" but a pass "on the data".

First pass: As you read each line you parse it. You are building some sort of data structure that has the instructions in file order. When you come across a label like one:, you keep track of what instruction that was in front of or perhaps you have a marker between instructions however you choose to implement it. When you come across an instruction that uses a label you have two choices, you can right now go look for that label, and if it is a backwards looking label then you should have seen it already like the jnz one instruction. IF you have thus far been keeping track of the number and size (if variable word length) instructions you can choose to encode this instruction now if it is a relative instruction, if the instruction set uses absolute you might have to just leave a placeholder anyway.

Now the call fun and jump more_fun instructions pose a problem, when you get to these instructions you cannot resolve them at this time, you dont know if these labels are local to this file or are in another file, so you cannot encode this instruction on the first pass, you have to save it for later, and this is the reason for the second pass.

The second pass is likely to be a pass across your data structures and not actually on the file, and this is heavily implementation specific. For example you might have a one dimensional array of structures and everything is in there. You may choose to make many passes on that data for example, start one index through the array looking for unresolved labels. When you find an unresolved label, send a second index through the array looking for a label definition. If you dont find it then, application specific, does your assembler create objects to be linked later or does it create a binary does it have to have everything resolved in this one assembly to binary step? If object then you assume this is external, unless application specific, your assembler requires external labels to be defined as external. So whether or not the missing label is an error is application specific. if it is not an error then, application specific, you should encode for the longest/farthest type of branch leaving the address or distance details for the linker to fill in.

For the labels you have found you now have a rough idea on how far. Now, depending on the instruction set and/or features of your assembler, you need to make several more passes on the data. You need to start encoding the instructions, assuming you have at least one flavor of relative distance call or branch instruction, you have to decide on the first encoding pass whether to hope for the, what i assume, is a shorter/smaller instruction for the relative distance branch or assume the larger one. You cant really determine if the smaller one will reach until you get one or a few encoding passes across the instructions.

top:
  ...
  jmp down
  ...
  jnz top
  ...
down:

As you encode the jmp down, you might choose optimistically to encode it as a smaller (number of bytes/words if variable word length) relative branch leaving the distance to be determined. When you get to the jnz top, lets say it is exactly to the byte just close enough to top to encode using a relative branch. On the second pass though you have to go back and finish the jmp down you find that it wont reach, you need more bytes/words to encode it as a long branch. Now the jnz top has to become a far branch as well (causing down to move again). You have to keep passing over the instructions, computing their distance far/short until you make pass with no changes. Be careful not to get caught in an infinite loop, where one pass you get to shorten an instruction, but that causes another to lengthen, and on the next pass the lengthen one causes the other to lengthen but the second to shorten and this repeats forever.

We could go back to the top of this and in your first pass you might build more than one or several data structures, maybe as you go you build a list of found labels, and a list of missing labels. And the second pass you look through the list of missing and see if they are in the found then resolve them that way. Or maybe on the first pass, and some might argue this is a single pass assembler, when you find a label, before continuing through the file you look back to see if anyone was looking for that label (or if that label had already been defined to declare an error) I would call this a multi pass assembler because it still passes through the data many times.

And now lets make it much worse. Look at the arm instruction set as an example and any other fixed length instruction set. Your relative branches are usually encoded in one instruction, thus fixed length instruction set. A far branch normally involves a load pc from the data found at this address, meaning you really need two items the instruction, then somewhere within the relative reach of that instruction a data word containing the absolute address of where to branch. You can choose to force the user to create these, but with the ARM assemblers for example they can and will do this for you, the simplest example is:

ldr r0,=0x12345678
...
b somewhere

That syntax means load r0 with the value 0x12345678, which does not fit in an arm instruction. What the assembler does with that syntax is it tries to find a dead spot in the code within reach of that instruction where it can place the data value, then it encodes that instruction as a load from pc relative address. For example after an unconditional branch is a good place to hide data. sometimes you have to use directives like .pool to encourage or remind the assembler good places to stick this data. r0 is not the program counter r15 is and you could use r15 there to connect this to the branching discussion above.

Take a look at the assembler I created for this project http://github.com/dwelch67/lsasim, a fixed length instruction set, but I force the user to allocate the word and load from it, I dont allow the shortcut the arm assemblers tend to allow.

I hope this helps explain things. The bottom line is that you cannot resolve lables in one linear pass through the data, you have to go back and connect the dots to the forward referenced labels. And I argue you have to do many passes anyway to resolve all of the long/short encodings (unless the instruction set/syntax forces the user to explicitly specify an absolute vs relative branch, and some do rjmp vs jmp or rjmp vs ljmp, rcall vs call, etc). Making one pass on the "file" sure, not a problem. If you allow include type directives some tools will create a temporary file where it pulls all the includes in creating a single file which has no includes in it, and then the tool makes one pass on this (this is how gcc manages includes for example, save intermediate files sometime and see what files are produced)(if you report line numbers with warnings/errors then you have to manage the temp file lines vs the original file name and line.).

like image 43
old_timer Avatar answered Sep 19 '25 19:09

old_timer