I'm writing an HTML parser in C, and am looking to correctly follow the W3C guidelines on parser implementation. One of the key points is that the parser operates on a stream of Unicode Code Points rather than bytes, which makes sense.
Basically, then, given a buffer of known character encoding (I will either be given an explicit input encoding, or will use the HTML5 prescan algorithm to make a good guess), what's the best way in C — ideally cross-platform, but sticking to UNIX is fine — to iterate over an equivalent sequence of Unicode Code Points?
Is alloc'ing a few reasonably-sized buffers and using iconv the way to go? Should I be looking at ICU? The macros like U16_NEXT seem to be well-suited to my task, but the ICU documentation is incredibly long-winded, and it's a little hard to see exactly how to glue things together.
ICU is a good choice. I used it with C++ and liked it a lot. I am quite sure you get similar thought-through APIs in C as well.
Not totally the same but somewhat related might be this tutorial that explains how to perform streaming/incremental transliteration (the difficulty in this case is that the "cursor" may be inside a code point sometimes).
The following will decode a code point and return how much to increment the string by (how much was "chewed"). Note that xs_utf16 is an unsigned short. More info: http://sree.kotay.com/2006/12/unicode-is-pain-in.html
enum
{
    xs_UTF_Max          = 0x0010FFFFUL,
    xs_UTF_Replace      = 0x0000FFFDUL,
    xs_UTF16_HalfBase   = 0x00010000UL,
    xs_UTF16_HighStart  = 0x0000D800UL,
    xs_UTF16_HighEnd    = 0x0000DBFFUL,
    xs_UTF16_LowStart   = 0x0000DC00UL,
    xs_UTF16_LowEnd     = 0x0000DFFFUL,
    xs_UTF16_MaxUCS2    = 0x0000FFFFUL,
    xs_UTF16_HalfMask   = 0x000003FFUL,
    xs_UTF16_HalfShift  = 10
};
int32 xs_UTF16Decode (uint32 &code, const xs_utf16* str, int32 len, bool strict)
{
          if (str==0||len==0)          {code=0; return 0;}
          uint32 c1 = str[0];
          //note: many implementations test from HighStart to HighEnd,
          //                 this may be a partial code point, and is incorrect(?)
          //                 trivial checking should exclude the WHOLE surrogate range
          if (c1<xs_UTF16_HighStart || c1>xs_UTF16_LowEnd)          return 1;
                             //really an error if we're starting in the low range
          //surrogate pair
          if (len<=1 || str[1]==0)                                  {code=xs_UTF_Replace; return strict ? 0 : 1;} //error
          uint32 c2 = str[1];
          code = ((c1-xs_UTF16_HighStart)<<xs_UTF16_HalfShift) + (c2-xs_UTF16_LowStart) + xs_UTF16_HalfBase;
          if (strict==false)                                        return 2;
          //check for errors
          if (c1>=xs_UTF16_LowStart && c1<=xs_UTF16_LowEnd)         {code=xs_UTF_Replace; return 0;} //error
          if (c2<xs_UTF16_LowStart  || c2>xs_UTF16_LowEnd)          {code=xs_UTF_Replace; return 0;} //error
          if (code>xs_UTF_Max)                                      {code=xs_UTF_Replace; return 0;} //error
          //success
          return 2;
}
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With