Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading a UTF-8 Unicode file through non-unicode code

I have to read a text file which is Unicode with UTF-8 encoding and have to write this data to another text file. The file has tab-separated data in lines.

My reading code is C++ code without unicode support. What I am doing is reading the file line-by-line in a string/char* and putting that string as-is to the destination file. I can't change the code so code-change suggestions are not welcome.

What I want to know is that while reading line-by-line can I encounter a NULL terminating character ('\0') within a line since it is unicode and one character can span multiple bytes.

My thinking was that it is quite possible that a NULL terminating character could be encountered within a line. Your thoughts?

like image 543
Aamir Avatar asked Dec 10 '22 20:12

Aamir


2 Answers

UTF-8 uses 1 byte for all ASCII characters, which have the same code values as in the standard ASCII encoding, and up to 4 bytes for other characters. The upper bits of each byte are reserved as control bits. For code points using more then 1 byte, the control bits are set.

Thus there shall not be 0 character in your UTF-8 file.

Check Wikipedia for UTF-8

like image 101
CsTamas Avatar answered Dec 28 '22 01:12

CsTamas


Very unlikely: all the bytes in an UTF-8 escape sequence have the higher bit set to 1.

like image 32
Maurice Perry Avatar answered Dec 28 '22 00:12

Maurice Perry