Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert UTF-16 to ASCII

I'm writing a subroutine in MIPS assembly language to convert ASCII into UTF-16 and vice versa. However, I could not find any trick how to convert it.

like image 246
Yunus Eren Güzel Avatar asked Mar 19 '11 21:03

Yunus Eren Güzel


2 Answers

Pseudocode, assuming that your bytes are octets and that no zero termination is required:

Conversion from ASCII to UTF-16

  1. Given an ASCII input string of length n (in bytes) stored sequentially in memory at address p.
  2. Allocate 2 × n bytes of memory; let the start address of that memory be q.
  3. While n is larger than zero:
    1. Check whether the byte at p is a valid ASCII character. If you don't use checksumming, the most significant bit has to be zero, otherwise it has to be the correct checksum. Issue an error if the byte is not valid.
    2. Zero-extend the byte at p to the 16-bit word at q. How this is done depends on the instruction set; e.g., x86 has MOVZX. You may also pay attention to the correct endianness.
    3. Increment p by 1.
    4. Increment q by 2.
    5. Decrement n by 1.

Lossless conversion from UTF-16 to ASCII

  1. Given an UTF-16 input string of length n (in code units) stored sequentially in memory at address p.
  2. Allocate n bytes of memory; let the start address of that memory be q.
  3. While n is larger than zero:
    1. Check whether the 16-bit word at p represents a valid ASCII character. The nine most significant bits have to be zero, otherwise the character is not representable in ASCII. Issue an error if the word is not valid.
    2. Move the least significant byte of the 16-bit word at p to the byte at q.
    3. If required, add a checksum to the byte at q.
    4. Increment p by 2.
    5. Increment q by 1.
    6. Decrement n by 1.
like image 122
Philipp Avatar answered Oct 12 '22 10:10

Philipp


The term ASCII is not very specific.

ISO-646 is a subset of Unicode UTF-16. So '7-bit' ASCII numbers are already Unicode (i.e. you just drop them into the bottom of a 16 bit value), and, for the other direction, all you have to do is take the low 8 bits from Unicode to get the ASCII if this is what you mean.

If you need, on the other hand, ISO-8859-1 (Latin-1), you'll need a conversion table. There is no formula that can be translated into simple instructions in assembly language.

like image 33
bmargulies Avatar answered Oct 12 '22 09:10

bmargulies