The Unicode standard now encompasses 144,076 characters as of version 13.1. It includes all of your favorite emoji, as well as characters used in almost every language on the planet.
Unicode is a universal character set. It is aimed to include all the characters needed for any writing system or language. The first code point positions in Unicode use 16 bits to represent the most commonly used characters in a number of languages. This Basic Multilingual Plane allows for 65,536 characters.
UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units.
You only count the characters that have the top two bits are not set to 10
(i.e., everything less that 0x80
or greater than 0xbf
).
That's because all the characters with the top two bits set to 10
are UTF-8 continuation bytes.
See here for a description of the encoding and how strlen
can work on a UTF-8 string.
For slicing and dicing UTF-8 strings, you basically have to follow the same rules. Any byte starting with a 0
bit or a 11
sequence is the start of a UTF-8 code point, all others are continuation characters.
Your best bet, if you don't want to use a third-party library, is to simply provide functions along the lines of:
utf8left (char *destbuff, char *srcbuff, size_t sz);
utf8mid (char *destbuff, char *srcbuff, size_t pos, size_t sz);
utf8rest (char *destbuff, char *srcbuff, size_t pos;
to get, respectively:
sz
UTF-8 bytes of a string.sz
UTF-8 bytes of a string, starting at pos
.pos
.This will be a decent building block to be able to manipulate the strings sufficiently for your purposes.
Try this for size:
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
// returns the number of utf8 code points in the buffer at s
size_t utf8len(char *s)
{
size_t len = 0;
for (; *s; ++s) if ((*s & 0xC0) != 0x80) ++len;
return len;
}
// returns a pointer to the beginning of the pos'th utf8 codepoint
// in the buffer at s
char *utf8index(char *s, size_t pos)
{
++pos;
for (; *s; ++s) {
if ((*s & 0xC0) != 0x80) --pos;
if (pos == 0) return s;
}
return NULL;
}
// converts codepoint indexes start and end to byte offsets in the buffer at s
void utf8slice(char *s, ssize_t *start, ssize_t *end)
{
char *p = utf8index(s, *start);
*start = p ? p - s : -1;
p = utf8index(s, *end);
*end = p ? p - s : -1;
}
// appends the utf8 string at src to dest
char *utf8cat(char *dest, char *src)
{
return strcat(dest, src);
}
// test program
int main(int argc, char **argv)
{
// slurp all of stdin to p, with length len
char *p = malloc(0);
size_t len = 0;
while (true) {
p = realloc(p, len + 0x10000);
ssize_t cnt = read(STDIN_FILENO, p + len, 0x10000);
if (cnt == -1) {
perror("read");
abort();
} else if (cnt == 0) {
break;
} else {
len += cnt;
}
}
// do some demo operations
printf("utf8len=%zu\n", utf8len(p));
ssize_t start = 2, end = 3;
utf8slice(p, &start, &end);
printf("utf8slice[2:3]=%.*s\n", end - start, p + start);
start = 3; end = 4;
utf8slice(p, &start, &end);
printf("utf8slice[3:4]=%.*s\n", end - start, p + start);
return 0;
}
Sample run:
matt@stanley:~/Desktop$ echo -n 你们好āa | ./utf8ops
utf8len=5
utf8slice[2:3]=好
utf8slice[3:4]=ā
Note that your example has an off by one error. theString[2] == "好"
The easiest way is to use a library like ICU
Depending on your notion of "character", this question can get more or less involved.
First off, you should transform your byte string into a string of unicode codepoints. You can do this with iconv()
of ICU, though if this is the only thing you do, iconv()
is a lot easier, and it's part of POSIX.
Your string of unicode codepoints could be something like a null-terminated uint32_t[]
, or if you have C1x, an array of char32_t
. The size of that array (i.e. its number of elements, not its size in bytes) is the number of codepoints (plus the terminator), and that should give you a very good start.
However, the notion of a "printable character" is fairly complex, and you may prefer to count graphemes rather than codepoints - for instance, an a
with an accent ^
can be expressed as two unicode codepoints, or as a combined legacy codepoint â
- both are valid, and both are required by the unicode standard to be treated equally. There is a process called "normalization" which turns your string into a definite version, but there are many graphemes which are not expressible as a single codepoint, and in general there is no way around a proper library that understands this and counts graphemes for you.
That said, it's up to you to decide how complex your scripts are and how thoroughly you want to treat them. Transforming into unicode codepoints is a must, everything beyond that is at your discretion.
Don't hesitate to ask questions about ICU if you decide that you need it, but feel free to explore the vastly simpler iconv()
first.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With