I'm storing a UTF-8 character in eax
and later on, in processing, I need to know how many bytes make up the character.
I've narrowed down going about this that minimizes shifts and masks and wanted to know am I missing some neat trick somewhere?
Option 1: Brute Force
mov r11, 4 ; Maximum bytes
bt eax, 31 ; Test 4th MSB
jc .exit
dec r11 ; Lets try 3
bt eax, 23 ; Test 3rd MSB
jc .exit
dec r11 ; Lets try 2
bt eax, 15 ; Test 2nd MSB
jc .exit
dec r11 ; It's straight up ascii (1 byte)
.exit:
Note:
eax
register wrong as pointed out by, well, everyone.If you can assume a correct encoding of the character, you can simply check where the highest zero is in the first code-unit (thanks to the auto-synch property of UTF-8).
The culprit being that for code-points of one code-unit the highest zero is bit 7. For the code-points of n code-units the highest bit is 7 - n (note the "discontinuity").
Assuming the first code-unit is in al
.
not al ;Trasform highest 0 in highest 1
bsr al, al ;Find the index (from bit0) of the first 1 from the left
xor al, 7 ;Perform 7 - index
;This gives 0 for single code unit code points
mov ah, 1
cmovz al, ah ;Change back to 1
Note that bsr
is not defined for an input of 0, but that can only happen for an invalid leading code-unit (of value 11111111b).
You can detect the invalid 0xff code-unit with a jz <error handler>
after the bsr
instruction.
Thanks to @CodyGray for pointing out a bug on the original version.
Thanks to @PeterCorders for pointing out the XOR trick to do 7 - AL.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With