Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to index through a string in assembly

Tags:

intel

Given the variable:

var1    db  "abcdefg", NULL

How would I perform a loop to navigate each letter? In C++ you would do something like var[x] inside the loop, then increment x each time. Any ideas?

like image 400
Jenke Avatar asked Jun 14 '17 03:06

Jenke


People also ask

How is indexing done in strings?

Strings are ordered sequences of character data, 00:15 and the individual characters of a string can be accessed directly using that numerical index. String indexing in Python is zero-based, so the very first character in the string would have an index of 0 , 00:30 and the next would be 1 , and so on.

Is there indexing for strings?

Because strings, like lists and tuples, are a sequence-based data type, it can be accessed through indexing and slicing.

What is index in assembly language?

2 Arrays in assembly language An array is a collection of variables, all of the same type, which you access by specifying a subscript (also called an index) which identifies one of the variables in the collection.

How do you get the index of an element in a string in C?

Just subtract the string address from what strchr returns: char *string = "qwerty"; char *e; int index; e = strchr(string, 'e'); index = (int)(e - string); Note that the result is zero based, so in above example it will be 2.


1 Answers

In C and C++, strings are NUL terminated. This means that an ASCII NUL character (0) is added to the end of the string so that code can tell where the string ends. The strlen function walks through the string, starting from the beginning, and keeps looping until it encounters this NUL character. When it finds the NUL, it knows that's the end of the string, and it returns the number of characters from the beginning to the NUL as the string's length.

String literals (the things in double-quotation marks) are automatically NUL-terminated by a C/C++ compiler, so that:

"abcdefg"

is equivalent to the following array:

{'a', 'b', 'c', 'd', 'e', 'f', 'g', 0}

I mention this because Peter Rader suggested it in his answer, and you didn't really understand what he was talking about. However, it seems that you already know this, as you appended a NUL character to your string in the assembly declaration:

var1    db  "abcdefg", NULL

Now, generally, we don't use the identifier NULL for this. Especially not in C, where NULL is defined as a null pointer. We just use the literal 0, so that definition would be:

var1    db  "abcdefg", 0

but your code probably works, assuming that NULL is somewhere defined as 0.

So you've got the setup all correct. Now all you need to do is write your loop:

    mov  edx, OFFSET var1    ; get starting address of string

Loop:
    mov  al, BYTE PTR [edx]  ; get next character
    inc  edx                 ; increment pointer
    test al, al              ; test value in AL and set flags
    jz   Finished            ; AL == 0, so exit the loop

    ; Otherwise, AL != 0, so we fell through.
    ; Here, you can do do something with the character in AL.
    ; ...

    jmp  Loop                ; keep looping

Finished:

You say that you're familiar with the CMP instruction. In the code above, I used TEST rather than CMP. You could have equivalently written:

cmp  al, 0

but

test al, al

is slightly more efficient because it is a smaller instruction, so I'm just in the habit of writing it that way in the special case that I'm comparing a register's value to 0. Compilers will generate this code, too, so it's good to be familiar with it.


Bonus chatter: An alternative way of representing a string is to store its length (in characters) along with the string itself. This is what the Pascal language traditionally did. This way, you don't need the special NUL sentinel character at the end of the string. Rather, the declaration would look like this:

var1    db  7, "abcdefg"

where the first byte of every string is its length. This has various advantages over the C style, namely that you don't have to iterate through the entire string to determine its length. The primary disadvantage, of course, is that a string's length is limited to 255 characters, since that's all that will fit into a BYTE.

Anyway, with the length known in advance, you're no longer checking for a NUL character, you're just iterating the same number of times as the characters in the string:

    mov  edx, OFFSET var1    ; get starting address of string
    mov  cl, BYTE PTR [edx]  ; get length of string

Loop:
    inc  edx                 ; increment pointer
    dec  cl                  ; decrement length
    mov  al, BYTE PTR [edx]  ; get next character
    jz   Finished            ; CL == 0, so exit the loop

    ; Do something with the character in AL.
    ; ...

    jmp  Loop                ; keep looping

Finished:

(In the code above, I've assumed that all strings are a minimum of 1 character in length. This is probably a safe assumption, and avoids the need to do a length check above the loop.)

Alternatively, you could do the array-indexing that you mentioned, but you have to be a bit careful if you want to iterate forwards through the string:

    mov   edx, OFFSET var1        ; get starting address of string
    movzx ecx, BYTE PTR [edx]     ; get length of string
    lea   edx, [ecx+1]            ; increment pointer by 1 + number of chars
    neg   ecx                     ; negate the length counter
Loop:
    mov   al, BYTE PTR [edx+ecx]  ; get next character

    ; Do something with the character in AL.
    ; ...

    inc   ecx
    jnz   Loop                     ; CL != 0, so keep looping

Basically, we set EDX to point to the end of the string, we set the counter (ECX) to the negative of the length of the string, and then we read characters by indexing [EDX+ECX] (which, since we negated ECX, is equivalent to [EDX-ECX]).

There is almost certainly a better (more clever) way of doing this than I've managed to think up here, but you should get the idea.

like image 150
Cody Gray Avatar answered Oct 06 '22 08:10

Cody Gray