I'm looking for a small C library to handle utf8 strings.
Specifically, splitting based on unicode delimiters for use with stemming algorithms.
Related posts have suggested:
ICU http://www.icu-project.org/ (I found it too bulky for my purposes on embedded devices)
UTF8-CPP: http://utfcpp.sourceforge.net/ (Excellent, but C++ not C)
Has anyone found any platform independent, small codebase libraries for handling unicode strings (doesn't need to do naturalisation).
It can represent all 1,114,112 Unicode characters. Most C code that deals with strings on a byte-by-byte basis still works, since UTF-8 is fully compatible with 7-bit ASCII. Characters usually require fewer than four bytes. String sort order is preserved.
UTF-8 is a character encoding - a way of converting from sequences of bytes to sequences of characters and vice versa. It covers the whole of the Unicode character set.
C++ provides a wide-character type, wchar_t , which can store Unicode strings. The exact implementation of wchar_t is implementation defined, but it is often UTF-32. The class wstring , defined in <string> , is a sequence of wchar_t s, just like the string class is a sequence of char s.
UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”
A nice, light, library which I use successfully is utf8proc.
There's also MicroUTF-8, but it may require login credentials to view or download the source.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With