Efficient UTF-8 character-length decoding for a non-zero character in a 32 bit register

Question

I'm storing a UTF-8 character in eax and later on, in processing, I need to know how many bytes make up the character.

I've narrowed down going about this that minimizes shifts and masks and wanted to know am I missing some neat trick somewhere?

Option 1: Brute Force

    mov     r11, 4      ;   Maximum bytes
    bt      eax, 31     ;   Test 4th MSB
    jc      .exit 
    dec     r11         ;   Lets try 3
    bt      eax, 23     ;   Test 3rd MSB
    jc      .exit 
    dec     r11         ;   Lets try 2
    bt      eax, 15     ;   Test 2nd MSB
    jc      .exit 
    dec     r11         ;   It's straight up ascii (1 byte)
.exit:

Note:

I had the accumulation in the eax register wrong as pointed out by, well, everyone.
Both Margaret and Ped7g provided solutions and I learnt even more than expected.

Margaret Bloom · Accepted Answer

If you can assume a correct encoding of the character, you can simply check where the highest zero is in the first code-unit (thanks to the auto-synch property of UTF-8).

The culprit being that for code-points of one code-unit the highest zero is bit 7. For the code-points of n code-units the highest bit is 7 - n (note the "discontinuity").

Assuming the first code-unit is in al.

not al                 ;Trasform highest 0 in highest 1
bsr al, al             ;Find the index (from bit0) of the first 1 from the left
xor al, 7              ;Perform 7 - index
                       ;This gives 0 for single code unit code points
mov ah, 1
cmovz al, ah           ;Change back to 1

Note that bsr is not defined for an input of 0, but that can only happen for an invalid leading code-unit (of value 11111111b).

You can detect the invalid 0xff code-unit with a jz <error handler> after the bsr instruction.

Thanks to @CodyGray for pointing out a bug on the original version.
Thanks to @PeterCorders for pointing out the XOR trick to do 7 - AL.

Efficient UTF-8 character-length decoding for a non-zero character in a 32 bit register

Tags:

assembly

utf-8

x86-64

micro-optimization

nasm

Frank C.

1 Answers

Margaret Bloom

Recent Activity

Donate For Us

Efficient UTF-8 character-length decoding for a non-zero character in a 32 bit register

Tags:

assembly

utf-8

x86-64

micro-optimization

nasm

Frank C.

1 Answers

Margaret Bloom

Related questions

Recent Activity

Donate For Us