Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert a single-byte const char* to a UTF-8 encoding

I have a function which requires me to pass a UTF-8 string pointed by a char*, and I have the char pointer to a single-byte string. How can I convert the string to UTF-8 encoding in C++? Is there any code I can use to do this? Thanks!

like image 670
Luca Carlon Avatar asked Dec 17 '10 11:12

Luca Carlon


People also ask

How do you convert bytes to UTF-8?

In order to convert a String into UTF-8, we use the getBytes() method in Java. The getBytes() method encodes a String into a sequence of bytes and returns a byte array. where charsetName is the specific charset by which the String is encoded into an array of bytes.

How do I change my encoding to UTF-8?

UTF-8 Encoding in Notepad (Windows) Click File in the top-left corner of your screen. In the dialog which appears, select the following options: In the "Save as type" drop-down, select All Files. In the "Encoding" drop-down, select UTF-8.

How do I convert a file to UTF-8?

Click Tools, then select Web options. Go to the Encoding tab. In the dropdown for Save this document as: choose Unicode (UTF-8). Click Ok.


Video Answer


3 Answers

Assuming Linux, you're looking for iconv. When you open the converter (iconv_open), you pass from and to encoding. If you pass an empty string as from, it'll convert from the locale used on your system which should match the file system.

On Windows, you have pretty much the same with MultiByteToWideChar where you pass CP_ACP as the codepage. But on Windows you can simply call the Unicode version of the functions to get Unicode straight away and then convert to UTF-8 with WideCharToMultiByte and CP_UTF8.

like image 162
kichik Avatar answered Oct 07 '22 05:10

kichik


To convert a string to a different character encoding, use any of various character encoding libraries. A popular choice is iconv (the standard on most Linux systems).

However, to do this you first need to figure out the encoding of your input. There is unfortunately no general solution to this. If the input does not specify its encoding (like e.g. web pages generally do), you'll have to guess.

As to your question: You write that you get the string from calling readdir on a FAT32 file system. I'm not quite sure, but I believe readdir will return the file names as they are stored by the file system. In the case of FAT/FAT32:

  • The short file names are encoded in some DOS code page - which code page depends on how the files where written, there's no way to tell from just the file system AFAIK.
  • The long file names are in UTF-16.

If you use the standard vfat Linux kernel module to access the FAT32 partition, you should get long file names from readdir (unless a file only has an 8.3 name). These can be decoded as UTF-16. FAT32 stores the long file names in UTF-16 internally. The vfat driver will convert them to the encoding given by the iocharset= mount parameter (with the default being the default system encoding, I believe).

Additional information:

You may have to play with the mount options codepage and iocharset (see http://linux.die.net/man/8/mount ) to get filenames right on the FAT32 volume. Try to mount such that filenames are shown correctly in a Linux console, then proceed. There is some more explanation here: http://www.nslu2-linux.org/wiki/HowTo/MountFATFileSystems

like image 33
sleske Avatar answered Oct 07 '22 04:10

sleske


I guess the top bit is set on the 1 byte string so the function you're passing that to is expecting more than 1 byte to be passed.

First, print the string out in hex.

i.e.

unsigned char* str = "your string";
for (int i = 0; i < strlen(str); i++)
  printf("[%02x]", str[i]);

Now have a read of the wikipedia article on UTF8 encoding which explains it well.
http://en.wikipedia.org/wiki/UTF-8

UTF-8 is variable width where each character can occupy from 1 to 4 bytes.

Therefore, convert the hex to binary and see what the code point is.

i.e. if the first byte starts 11110 (in binary) then it's expecting a 4 byte string. Since ascii is 7-bit 0-127 the top bit is always zero so there should be only 1 byte. By the way, the bytes following the first byte in a wide character of a UTF8 string will start "10..." for the top bits. These are the continuation bytes... that's what your function is complaining about... i.e. the continuation bytes are missing when expected. So the string is not quite true ascii as you thought it was.

You can convert using as someone suggested iconv, or perhaps this library http://utfcpp.sourceforge.net/

like image 44
hookenz Avatar answered Oct 07 '22 06:10

hookenz