Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read UTF-8 string given its length in characters in plain C89?

Tags:

c

unicode

c89

I'm writing a custom cross-platform minimalistic TCP server in plain C89. (But I will also accept POSIX-specific answer.)

The server works with UTF-8 strings, but never looks inside them. It treats all strings as immutable binary blobs.

But now I need to accept UTF-8 strings from the client that does not know how to calculate their size in bytes. The client can only transmit string length in characters. (Update: The client is in JavaScript, and "length in characters" is, in fact, whatever String.length() returns. I assume it is actual UTF-8 characters, not something else.)

I do not want to add heavy dependencies to my tiny server. Is there a robust and neat way to read this datagram? (For the sake of this question, let's say that it is read from FILE *.)

U<CRLF>       ; data type marker (actually read by dispatching code)
<SIZE><CRLF>  ; UTF-8 string size in characters
<DATA><CRLF>  ; data blob

Example:

U
7
Юникод!

Update:

One batch of data can contain more than one datagram, so approximate reads would not work, I need to read exact amount of characters.

And the actual UTF-8 data may contain any characters, so I can't pick a character as a terminator — I don't want mess with escaping it in the data.

like image 229
Alexander Gladysh Avatar asked Apr 01 '11 18:04

Alexander Gladysh


2 Answers

It's pretty easy to write a UTF-8 "reader" given the information here; UTF-8 was designed so tasks like this one would be easy.

In essence, you start reading characters until you read as many as the client tells you. You know that you 've read a whole character given the UTF-8 encoding definition, specifically:

If the character is encoded by just one byte, the high-order bit is 0 and the other bits give the code value (in the range 0..127). If the character is encoded by a sequence of more than one byte, the first byte has as many leading '1' bits as the total number of bytes in the sequence, followed by a '0' bit, and the succeeding bytes are all marked by a leading "10" bit pattern.

like image 159
Jon Avatar answered Nov 07 '22 09:11

Jon


Well, the length property of JavaScript strings seems to count codepoints, not characters, as you can see (but wait! it's not quite codepoints):

> s1='\u0061\u0301'
'á'
> s2='\u00E1'
'á'
> s1.length
2
> s2.length
1
>

Although that's with V8. Looking around it seems that that's actually what the ECMAScript standard requires:

https://forums.teradata.com/blog/jasonstrimpel/2011/11/javascript-string-length-and-internationalizing-web-applications

Also, checking ECMA-262, on pages 40-41 of the PDF it says "The length of a String is the number of elements (i.e., 16-bit values) within it", and then goes on to make clear that the elements are UTF-16 units. Sadly that's not quite "codepoints". Basically, this makes the string length property rather useless. Looking around I find this:

How can I tell if a string contains multibyte characters in Javascript?

like image 20
Nico Avatar answered Nov 07 '22 11:11

Nico