I have a problem reading and using the content from unicode files.
I am working on a unicode release build, and I am trying to read the content from an unicode file, but the data has strange characters and I can't seem to find a way to convert the data to ASCII.
I'm using fgets
. I tried fgetws
, WideCharToMultiByte
, and a lot of functions which I found in other articles and posts, but nothing worked.
Because you mention WideCharToMultiByte I will assume you are dealing with Windows.
"read the content from an unicode file ... find a way to convert data to ASCII"
This might be a problem. If you convert Unicode to ASCII (or other legacy code page) you will run into the risk of corrupting/losing data. Since you are "working on a unicode release build" you will want to read Unicode and stay Unicode.
So your final buffer will have to be wchar_t
(or WCHAR
, or CStringW
, same thing).
So your file might be utf-16, or utf-8 (utf-32 is quite rare). For utf-16 the endianess might also matter. If there is a BOM that will help a lot.
Quick steps:
wopen
, or _wfopen
as binarywchar_t
with WideCharToMultiByte
and CP_UTF8
wchar_t
array and _swab
wchar_t
array and you are doneAlso (if you use a newer Visual Studio), you might take advantage of an MS extension to _wfopen
. It can take an encoding as part of the mode (something like _wfopen(L"newfile.txt", L"rw, ccs=<encoding>");
with the encoding being UTF-8 or UTF-16LE). It can also detect the encoding based on the BOM.
Warning: to be cross-platform is problematic, wchar_t
can be 2 or 4 bytes, the conversion routines are not portable...
Useful links:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With