I have a function which requires me to pass a UTF-8 string pointed by a char*, and I have the char pointer to a single-byte string. How can I convert the string to UTF-8 encoding in C++? Is there any code I can use to do this? Thanks!
In order to convert a String into UTF-8, we use the getBytes() method in Java. The getBytes() method encodes a String into a sequence of bytes and returns a byte array. where charsetName is the specific charset by which the String is encoded into an array of bytes.
UTF-8 Encoding in Notepad (Windows) Click File in the top-left corner of your screen. In the dialog which appears, select the following options: In the "Save as type" drop-down, select All Files. In the "Encoding" drop-down, select UTF-8.
Click Tools, then select Web options. Go to the Encoding tab. In the dropdown for Save this document as: choose Unicode (UTF-8). Click Ok.
Assuming Linux, you're looking for iconv. When you open the converter (iconv_open
), you pass from and to encoding. If you pass an empty string as from, it'll convert from the locale used on your system which should match the file system.
On Windows, you have pretty much the same with MultiByteToWideChar where you pass CP_ACP
as the codepage. But on Windows you can simply call the Unicode version of the functions to get Unicode straight away and then convert to UTF-8 with WideCharToMultiByte and CP_UTF8
.
To convert a string to a different character encoding, use any of various character encoding libraries. A popular choice is iconv (the standard on most Linux systems).
However, to do this you first need to figure out the encoding of your input. There is unfortunately no general solution to this. If the input does not specify its encoding (like e.g. web pages generally do), you'll have to guess.
As to your question: You write that you get the string from calling readdir
on a FAT32 file system. I'm not quite sure, but I believe readdir
will return the file names as they are stored by the file system. In the case of FAT/FAT32:
If you use the standard vfat
Linux kernel module to access the FAT32 partition, you should get long file names from readdir
(unless a file only has an 8.3 name). These can be decoded as UTF-16. FAT32 stores the long file names in UTF-16 internally. The vfat
driver will convert them to the encoding given by the iocharset=
mount parameter (with the default being the default system encoding, I believe).
Additional information:
You may have to play with the mount options codepage
and iocharset
(see http://linux.die.net/man/8/mount ) to get filenames right on the FAT32 volume. Try to mount such that filenames are shown correctly in a Linux console, then proceed. There is some more explanation here: http://www.nslu2-linux.org/wiki/HowTo/MountFATFileSystems
I guess the top bit is set on the 1 byte string so the function you're passing that to is expecting more than 1 byte to be passed.
First, print the string out in hex.
i.e.
unsigned char* str = "your string";
for (int i = 0; i < strlen(str); i++)
printf("[%02x]", str[i]);
Now have a read of the wikipedia article on UTF8 encoding which explains it well.
http://en.wikipedia.org/wiki/UTF-8
UTF-8 is variable width where each character can occupy from 1 to 4 bytes.
Therefore, convert the hex to binary and see what the code point is.
i.e. if the first byte starts 11110 (in binary) then it's expecting a 4 byte string. Since ascii is 7-bit 0-127 the top bit is always zero so there should be only 1 byte. By the way, the bytes following the first byte in a wide character of a UTF8 string will start "10..." for the top bits. These are the continuation bytes... that's what your function is complaining about... i.e. the continuation bytes are missing when expected. So the string is not quite true ascii as you thought it was.
You can convert using as someone suggested iconv, or perhaps this library http://utfcpp.sourceforge.net/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With