Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read UTF-8 encoded text file using std::ifstream?

I'm having a hard time to parse an xml file.

The file was saved with UTF-8 Encoding.

Normal ASCII are read correctly, but Korean characters are not.

So I made a simple program to read a UTF-8 text file and print the content.

Text File(test.txt)

ABC가나다

Test Program

#include <fstream>
#include <iostream>
#include <string>
#include <iterator>
#include <streambuf>

const char* hex(char c) {
    const char REF[] = "0123456789ABCDEF";
    static char output[3] = "XX";
    output[0] = REF[0x0f & c>>4];
    output[1] = REF[0x0f & c];
    return output;
}

int main() {
    std::cout << "File(ifstream) : ";
    std::ifstream file("test.txt");
    std::string buffer((std::istreambuf_iterator<char>(file)), std::istreambuf_iterator<char>());
    for (auto c : buffer) {
        std::cout << hex(c)<< " ";
    }
    std::cout << std::endl;
    std::cout << buffer << std::endl;

    //String literal
    std::string str = "ABC가나다";
    std::cout << "String literal : ";
    for (auto c : str) {
        std::cout << hex(c) << " ";
    }
    std::cout << std::endl;
    std::cout << str << std::endl;

    return 0;
}

Output

File(ifstream) : 41 42 43 EA B0 80 EB 82 98 EB 8B A4
ABC媛?섎떎
String literal : 41 42 43 B0 A1 B3 AA B4 D9
ABC가나다

The output said that characters are encoded differently in string literal and file.

So far as I know, in c++ char strings are encoded in UTF-8 so we can see them through printf or cout. So their bytes were supposed to be same, but they were different actually...

Is there any way to read UTF-8 text file using std::ifstream?


I succeed to parse xml file using std::wifstream following this article.

But most of the libraries I'm using are supporting only const char* string so I'm searching for another way to use std::ifstream.

And also I've read this article saying that do not use wchar_t. Treating char string as multi-bytes character is sufficient.

like image 714
JaeJun LEE Avatar asked Apr 08 '17 15:04

JaeJun LEE


People also ask

Does STD string support UTF-8?

UTF-8 actually works quite well in std::string . Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.

Can UTF-8 be read as ASCII?

Each character is represented by one to four bytes. UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character. The first 128 UTF-8 characters precisely match the first 128 ASCII characters (numbered 0-127), meaning that existing ASCII text is already valid UTF-8.

What is encoding =' UTF-8?

UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”


1 Answers

Encoding "ABC가나다" using UTF-8 should give you

"\x41\x42\x43\xEA\xB0\x80\xEB\x82\x98\xEB\x8B\xA4"

so the content of the file you got is correct. The problems is with your source file encoding. You are not allowed to use non-ascii symbols in string literals like that, you should prefix them with u8 to get UTF-8 literal:

u8"ABC가나다"

At this point I assume you are using Windows, otherwise you wouldn't have any issues with encodings. You will have to change your terminals character set to UTF-8:

chcp 65001

What is happening in your case is that you are reading UTF-8 text from a file to a string, then printing it to non-unicode terminal which is unable to show it as you expect. When you are printing your string literal, you are printing non-unicode sequence, but this sequences enconding matches your terminal encoding, so you can see what you expected.

PS: I used https://mothereff.in/utf-8 to get UTF-8 represenation of your string in hex.

like image 119
StaceyGirl Avatar answered Oct 03 '22 16:10

StaceyGirl