Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert Unicode code points to UTF-8 and UTF-32

Tags:

c

utf-8

utf

I can't think of a way to remove the leading zeros. My goal was in a for loop to then create the UTF-8 and UTF-32 versions of each number.

For example, with UTF-8 wouldn't I have to remove the leading zeros? Does anyone have a solution for how to pull this off? Basically what I am asking is: does someone have a easy solution to convert Unicode code points to UTF-8?

    for (i = 0x0; i < 0xffff; i++) {
        printf("%#x \n", i);
        //convert to UTF8
    }

So here is an example of what I am trying to accomplish for each i.

  • For example: Unicode value U+0760 (Base 16) would convert to UTF8 as
    • in binary: 1101 1101 1010 0000
    • in hex: DD A0

Basically I am trying to do that for every i is convert it to its hex equivalent in UTF-8.

The problem I am running into is it seems the process for converting Unicode to UTF-8 involves removing leading 0s from the bit number. I am not really sure how to do that dynamically.

like image 446
Joe Caraccio Avatar asked Dec 05 '22 15:12

Joe Caraccio


1 Answers

As the Wikipedia UTF-8 page describes, each Unicode code point (0 through 0x10FFFF) is encoded in UTF-8 character as one to four bytes.

Here is a simple example function, edited from one of my earlier posts. I've now removed the U suffixes from the integer constants too. (.. whose intent was to remind the human programmer that the constants are explicitly unsigned for a reason (negative code points not considered at all), and it does assume unsigned int code -- the compiler does not care, and probably because of that this practice seems to be odd and confusing even to long-term members here, so I give up and stop trying to include such reminders. :( )

static size_t code_to_utf8(unsigned char *const buffer, const unsigned int code)
{
    if (code <= 0x7F) {
        buffer[0] = code;
        return 1;
    }
    if (code <= 0x7FF) {
        buffer[0] = 0xC0 | (code >> 6);            /* 110xxxxx */
        buffer[1] = 0x80 | (code & 0x3F);          /* 10xxxxxx */
        return 2;
    }
    if (code <= 0xFFFF) {
        buffer[0] = 0xE0 | (code >> 12);           /* 1110xxxx */
        buffer[1] = 0x80 | ((code >> 6) & 0x3F);   /* 10xxxxxx */
        buffer[2] = 0x80 | (code & 0x3F);          /* 10xxxxxx */
        return 3;
    }
    if (code <= 0x10FFFF) {
        buffer[0] = 0xF0 | (code >> 18);           /* 11110xxx */
        buffer[1] = 0x80 | ((code >> 12) & 0x3F);  /* 10xxxxxx */
        buffer[2] = 0x80 | ((code >> 6) & 0x3F);   /* 10xxxxxx */
        buffer[3] = 0x80 | (code & 0x3F);          /* 10xxxxxx */
        return 4;
    }
    return 0;
}

You supply it with an unsigned char array, four chars or larger, and the Unicode code point. The function will return how many chars were needed to encode the code point in UTF-8, and were assigned in the array. The function will return 0 (not encoded) for codes above 0x10FFFF, but it does not otherwise check that the Unicode code point is valid. Ie. it is a simple encoder, and all it knows about Unicode is that the code points are from 0 to 0x10FFFF, inclusive. It knows nothing about surrogate pairs, for example.

Note that because the code point is explicitly an unsigned integer, negative arguments will be converted to unsigned according to C rules.

You need to write a function that prints out the least 8 significant bits in each unsigned char (the C standard does allow larger char sizes, but UTF-8 only uses 8-bit chars). Then, use the above function to convert an Unicode code point (0 to 0x10FFFF, inclusive) to UTF-8 representation, and call your bit function for each unsigned char in the array, in increasing order, for the count of unsigned char the above conversion function returned for that code point.

like image 86
Nominal Animal Avatar answered Dec 30 '22 22:12

Nominal Animal