How to read UTF-8 encoded text file using std::ifstream?

Text File(test.txt)

ABC가나다

Test Program

#include <fstream>
#include <iostream>
#include <string>
#include <iterator>
#include <streambuf>

const char* hex(char c) {
    const char REF[] = "0123456789ABCDEF";
    static char output[3] = "XX";
    output[0] = REF[0x0f & c>>4];
    output[1] = REF[0x0f & c];
    return output;
}

int main() {
    std::cout << "File(ifstream) : ";
    std::ifstream file("test.txt");
    std::string buffer((std::istreambuf_iterator<char>(file)), std::istreambuf_iterator<char>());
    for (auto c : buffer) {
        std::cout << hex(c)<< " ";
    }
    std::cout << std::endl;
    std::cout << buffer << std::endl;

    //String literal
    std::string str = "ABC가나다";
    std::cout << "String literal : ";
    for (auto c : str) {
        std::cout << hex(c) << " ";
    }
    std::cout << std::endl;
    std::cout << str << std::endl;

    return 0;
}

Output

File(ifstream) : 41 42 43 EA B0 80 EB 82 98 EB 8B A4
ABC媛?섎떎
String literal : 41 42 43 B0 A1 B3 AA B4 D9
ABC가나다

The output said that characters are encoded differently in string literal and file.

So far as I know, in c++ char strings are encoded in UTF-8 so we can see them through printf or cout. So their bytes were supposed to be same, but they were different actually...

Is there any way to read UTF-8 text file using std::ifstream?

I succeed to parse xml file using std::wifstream following this article.

But most of the libraries I'm using are supporting only const char* string so I'm searching for another way to use std::ifstream.

And also I've read this article saying that do not use wchar_t. Treating char string as multi-bytes character is sufficient.

714

asked Apr 08 '17 15:04

JaeJun LEE

1 Answers

Encoding "ABC가나다" using UTF-8 should give you

"\x41\x42\x43\xEA\xB0\x80\xEB\x82\x98\xEB\x8B\xA4"

so the content of the file you got is correct. The problems is with your source file encoding. You are not allowed to use non-ascii symbols in string literals like that, you should prefix them with u8 to get UTF-8 literal:

u8"ABC가나다"

At this point I assume you are using Windows, otherwise you wouldn't have any issues with encodings. You will have to change your terminals character set to UTF-8:

chcp 65001

What is happening in your case is that you are reading UTF-8 text from a file to a string, then printing it to non-unicode terminal which is unable to show it as you expect. When you are printing your string literal, you are printing non-unicode sequence, but this sequences enconding matches your terminal encoding, so you can see what you expected.

PS: I used https://mothereff.in/utf-8 to get UTF-8 represenation of your string in hex.

119

answered Oct 03 '22 16:10

StaceyGirl

Related questions
                            
                                What's the difference between reprojectImageto3D(OpenCV) and disparity to 3D coordinates?
                            
                                How to read an RSA public key from a its PEM format string using the OpenSSL API?
                            
                                Default move constructor taking a const parameter
                            
                                Avoiding recursive template instantiation overflow in parallel recursive asynchronous algorithms
                            
                                Native Crash SIGSEGV in Android JNI
                            
                                How to call boost_compute 'BOOST_COMPUTE_FUNCTION' defined function?
                            
                                Why is opensslconf.h different for each architecture?
                            
                                Minimize matrix in Equation using OpenCV
                            
                                Template deduction for variadic template lambda arguments
                            
                                "Type is incomplete" (but isn't) and code compiles
                            
                                visual studio call stack windows does not display filename
                            
                                Type of an enumerator in the declaration of its enum
                            
                                Can I use a `mpfr_t` as both input and output argument?
                            
                                std::round is not a member of std on android
                            
                                Possible race condition in std::condition_variable?
                            
                                std::futures and exception
                            
                                no speedup using openmp + SIMD
                            
                                Convert unmanaged C++ pointer to an object to a managed C# object
                            
                                C++ - Definition of 2d matrices of type std::array
                            
                                What is the difference between a modifiable rvalue and a const rvalue?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to read UTF-8 encoded text file using std::ifstream?

Tags:

c++

string

encoding

utf-8

ifstream