Reading a UTF-8 Unicode file through non-unicode code

Question

I have to read a text file which is Unicode with UTF-8 encoding and have to write this data to another text file. The file has tab-separated data in lines.

My reading code is C++ code without unicode support. What I am doing is reading the file line-by-line in a string/char* and putting that string as-is to the destination file. I can't change the code so code-change suggestions are not welcome.

What I want to know is that while reading line-by-line can I encounter a NULL terminating character ('\0') within a line since it is unicode and one character can span multiple bytes.

My thinking was that it is quite possible that a NULL terminating character could be encountered within a line. Your thoughts?

CsTamas · Accepted Answer

UTF-8 uses 1 byte for all ASCII characters, which have the same code values as in the standard ASCII encoding, and up to 4 bytes for other characters. The upper bits of each byte are reserved as control bits. For code points using more then 1 byte, the control bits are set.

Thus there shall not be 0 character in your UTF-8 file.

Check Wikipedia for UTF-8

Thus there shall not be 0 character in your UTF-8 file.

Check Wikipedia for UTF-8

Maurice Perry · Answer

Very unlikely: all the bytes in an UTF-8 escape sequence have the higher bit set to 1.

Reading a UTF-8 Unicode file through non-unicode code

Tags:

c++

text-files

unicode

utf-8

Aamir

2 Answers

CsTamas

Maurice Perry

Recent Activity

Donate For Us

Reading a UTF-8 Unicode file through non-unicode code

Tags:

c++

text-files

unicode

utf-8

Aamir

2 Answers

CsTamas

Maurice Perry

Related questions

Recent Activity

Donate For Us