How to convert a single-byte const char* to a UTF-8 encoding

Tags:

I have a function which requires me to pass a UTF-8 string pointed by a char*, and I have the char pointer to a single-byte string. How can I convert the string to UTF-8 encoding in C++? Is there any code I can use to do this? Thanks!

670

asked Dec 17 '10 11:12

Luca Carlon

Video Answer

3 Answers

Assuming Linux, you're looking for iconv. When you open the converter (iconv_open), you pass from and to encoding. If you pass an empty string as from, it'll convert from the locale used on your system which should match the file system.

On Windows, you have pretty much the same with MultiByteToWideChar where you pass CP_ACP as the codepage. But on Windows you can simply call the Unicode version of the functions to get Unicode straight away and then convert to UTF-8 with WideCharToMultiByte and CP_UTF8.

162

answered Oct 07 '22 05:10

kichik

To convert a string to a different character encoding, use any of various character encoding libraries. A popular choice is iconv (the standard on most Linux systems).

However, to do this you first need to figure out the encoding of your input. There is unfortunately no general solution to this. If the input does not specify its encoding (like e.g. web pages generally do), you'll have to guess.

As to your question: You write that you get the string from calling readdir on a FAT32 file system. I'm not quite sure, but I believe readdir will return the file names as they are stored by the file system. In the case of FAT/FAT32:

The short file names are encoded in some DOS code page - which code page depends on how the files where written, there's no way to tell from just the file system AFAIK.
The long file names are in UTF-16.

If you use the standard vfat Linux kernel module to access the FAT32 partition, you should get long file names from readdir (unless a file only has an 8.3 name). ~~These can be decoded as UTF-16.~~ FAT32 stores the long file names in UTF-16 internally. The vfat driver will convert them to the encoding given by the iocharset= mount parameter (with the default being the default system encoding, I believe).

Additional information:

You may have to play with the mount options codepage and iocharset (see http://linux.die.net/man/8/mount ) to get filenames right on the FAT32 volume. Try to mount such that filenames are shown correctly in a Linux console, then proceed. There is some more explanation here: http://www.nslu2-linux.org/wiki/HowTo/MountFATFileSystems

answered Oct 07 '22 04:10

sleske

I guess the top bit is set on the 1 byte string so the function you're passing that to is expecting more than 1 byte to be passed.

First, print the string out in hex.

i.e.

unsigned char* str = "your string";
for (int i = 0; i < strlen(str); i++)
  printf("[%02x]", str[i]);

Now have a read of the wikipedia article on UTF8 encoding which explains it well.
http://en.wikipedia.org/wiki/UTF-8

UTF-8 is variable width where each character can occupy from 1 to 4 bytes.

Therefore, convert the hex to binary and see what the code point is.

i.e. if the first byte starts 11110 (in binary) then it's expecting a 4 byte string. Since ascii is 7-bit 0-127 the top bit is always zero so there should be only 1 byte. By the way, the bytes following the first byte in a wide character of a UTF8 string will start "10..." for the top bits. These are the continuation bytes... that's what your function is complaining about... i.e. the continuation bytes are missing when expected. So the string is not quite true ascii as you thought it was.

You can convert using as someone suggested iconv, or perhaps this library http://utfcpp.sourceforge.net/

answered Oct 07 '22 06:10

hookenz

Related questions
                            
                                gcc -finline-functions behaviour?
                            
                                Compiler reordering around mutex boundaries?
                            
                                how to pass vector of string to foo(char const *const *const)?
                            
                                BitBlt ignores CAPTUREBLT and seems to always capture a cached copy of the target
                            
                                How to write a flexible modular program with good interaction possibilities between modules?
                            
                                How to tell if a MFC Dialog has been created/initialized?
                            
                                OpenCV findContours function problem
                            
                                Minor (unimportant) defect in the standard?
                            
                                Are tuples of tuples allowed?
                            
                                C++ standard, overloaded function resolution/matching
                            
                                Can I default a function argument to the value of __FILE__ at the caller?
                            
                                Why is type_info::name() unspecified?
                            
                                Operator precedence in boost::spirit?
                            
                                In C++, is a const method returning a pointer to a non-const object considered bad practice?
                            
                                Variable Argument lists with boost?
                            
                                The best Linux tool for disassembling C++ executables [closed]
                            
                                Fetch-and-add using OpenMP atomic operations
                            
                                Is C++ value of this guaranteed?
                            
                                QAbstractItemModel and QTreeView [closed]
                            
                                FFTW: signal consists of noise after IFFT

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to convert a single-byte const char* to a UTF-8 encoding

Tags:

c++

character-encoding

utf-8