Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stream of Unicode Code Points from Bytes in C?

Tags:

c

unicode

icu

I'm writing an HTML parser in C, and am looking to correctly follow the W3C guidelines on parser implementation. One of the key points is that the parser operates on a stream of Unicode Code Points rather than bytes, which makes sense.

Basically, then, given a buffer of known character encoding (I will either be given an explicit input encoding, or will use the HTML5 prescan algorithm to make a good guess), what's the best way in C — ideally cross-platform, but sticking to UNIX is fine — to iterate over an equivalent sequence of Unicode Code Points?

Is alloc'ing a few reasonably-sized buffers and using iconv the way to go? Should I be looking at ICU? The macros like U16_NEXT seem to be well-suited to my task, but the ICU documentation is incredibly long-winded, and it's a little hard to see exactly how to glue things together.

like image 830
Matt Patenaude Avatar asked Dec 20 '12 00:12

Matt Patenaude


2 Answers

ICU is a good choice. I used it with C++ and liked it a lot. I am quite sure you get similar thought-through APIs in C as well.

Not totally the same but somewhat related might be this tutorial that explains how to perform streaming/incremental transliteration (the difficulty in this case is that the "cursor" may be inside a code point sometimes).

like image 171
towi Avatar answered Nov 13 '22 09:11

towi


The following will decode a code point and return how much to increment the string by (how much was "chewed"). Note that xs_utf16 is an unsigned short. More info: http://sree.kotay.com/2006/12/unicode-is-pain-in.html

enum
{
    xs_UTF_Max          = 0x0010FFFFUL,
    xs_UTF_Replace      = 0x0000FFFDUL,
    xs_UTF16_HalfBase   = 0x00010000UL,
    xs_UTF16_HighStart  = 0x0000D800UL,
    xs_UTF16_HighEnd    = 0x0000DBFFUL,
    xs_UTF16_LowStart   = 0x0000DC00UL,
    xs_UTF16_LowEnd     = 0x0000DFFFUL,
    xs_UTF16_MaxUCS2    = 0x0000FFFFUL,
    xs_UTF16_HalfMask   = 0x000003FFUL,
    xs_UTF16_HalfShift  = 10
};


int32 xs_UTF16Decode (uint32 &code, const xs_utf16* str, int32 len, bool strict)
{
          if (str==0||len==0)          {code=0; return 0;}

          uint32 c1 = str[0];

          //note: many implementations test from HighStart to HighEnd,
          //                 this may be a partial code point, and is incorrect(?)
          //                 trivial checking should exclude the WHOLE surrogate range
          if (c1<xs_UTF16_HighStart || c1>xs_UTF16_LowEnd)          return 1;
                             //really an error if we're starting in the low range

          //surrogate pair
          if (len<=1 || str[1]==0)                                  {code=xs_UTF_Replace; return strict ? 0 : 1;} //error
          uint32 c2 = str[1];
          code = ((c1-xs_UTF16_HighStart)<<xs_UTF16_HalfShift) + (c2-xs_UTF16_LowStart) + xs_UTF16_HalfBase;

          if (strict==false)                                        return 2;

          //check for errors
          if (c1>=xs_UTF16_LowStart && c1<=xs_UTF16_LowEnd)         {code=xs_UTF_Replace; return 0;} //error
          if (c2<xs_UTF16_LowStart  || c2>xs_UTF16_LowEnd)          {code=xs_UTF_Replace; return 0;} //error
          if (code>xs_UTF_Max)                                      {code=xs_UTF_Replace; return 0;} //error

          //success
          return 2;
}
like image 24
sree Avatar answered Nov 13 '22 08:11

sree