Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ Text file won't save in Unicode, it keeps saving in ANSI

So basically, I need to be able to create a text file in Unicode, but whatever I do it keeps saving in ANSI.

Here's my code:

    wchar_t name[] = L"‎中國哲學書電子化計劃";
    FILE * pFile;
    pFile = fopen("chineseLetters.txt", "w");

    fwrite(name, sizeof(wchar_t), sizeof(name), pFile);
    fclose(pFile);

And here is the output of my "chineseLetters.txt":

     -NWòTx[øfû–P[SŠƒR  õ2123

Also, the application is in MBCS and cannot be changed into Unicode, because it needs to work with both Unicode and ANSI.

I'd really appreciate some help here. Thanks.

Thanks for all the quick replies! It works!

Simply adding L"\uFFFE‎中國哲學書電子化計劃" still didn't work, the text editor still recognized it as CP1252 so I did 2 fwrite instead of one, one for the BOM and one for the characters, here's my code now:

    wchar_t name[] = L"‎中國哲學書電子化計劃";
    unsigned char bom[] = { 0xFF, 0xFE };
    FILE * pFile;
    pFile = fopen("chineseLetters.txt", "w");
    fwrite(bom, sizeof(unsigned char), sizeof(bom), pFile);
    fwrite(name, sizeof(wchar_t), wcslen(name), pFile);
    fclose(pFile);
like image 556
Kelv Avatar asked Jan 20 '15 21:01

Kelv


People also ask

Do txt files support Unicode?

txt uses Unicode/UTF-8" is the Byte Order Mark at the beginning of the text file. By the way it is represented in actual bytes, it tells the reader which Unicode encoding to use to read the file.

Which is better ANSI or Unicode?

ANSI vs Unicode Usage is also the main difference between the two as ANSI is very old and is used by operating systems like Windows 95/98 and older, while Unicode is a newer encoding that is used by all of the current operating systems today.


1 Answers

I need to be able to create a text file in Unicode

Unicode is not an encoding, do you mean UTF-16LE? This is the two-byte-code-unit encoding Windows x86/x64 uses for internal string storage in memory, and some Windows applications like Notepad misleadingly describe UTF-16LE as “Unicode” in their UI.

fwrite(name, sizeof(wchar_t), sizeof(name), pFile);

You've copied the memory storage of the string directly to a file. If you compile this under Windows/MSVCRT then because the internal storage encoding is UTF-16LE, the file you have produced is encoded as UTF-16LE. If you compile this in other environments you will get different results.

And here is the output of my "chineseLetters.txt": -NWòTx[øfû–P[SŠƒR õ2123

That's what the UTF-16LE-encoded data would look like if you misinterpreted the file as Windows Code Page 1252 (Western European).

If you have loaded the file into a Windows application such as Notepad, it probably doesn't know that the file contains UTF-16LE-encoded data, and so defaults to reading the file using your default locale-specific (ANSI, mbcs) code page as the encoding, resulting in the above mojibake.

When you are making a UTF-16 file you should put a Byte Order Mark character U+FEFF at the start of it to let the consumer know whether it's UTF-16LE or UTF-16BE. This also gives apps like Notepad a hint that the file contains UTF-16 at all, and not ANSI. So you would probably find that writing L"\uFEFF‎中國哲學書電子化計劃" would make the output file display better in Notepad.

But it's probably better to convert the wchar_ts into char bytes in a particular desired encoding stated explicitly (eg UTF-8), rather than relying on what in-memory storage format the C library happens to use. On Win32 you can do this using the WideCharToMultibyte API, or with wide-open ccs as described by Mr.C64. If you choose to write a UTF-16LE file with ccs it will put the BOM in for you.

like image 77
bobince Avatar answered Sep 30 '22 00:09

bobince