Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to convert utf-8 to ASCII in c++?

Tags:

c++

i am getting response from server in utf-8 but not able to read that. how to convert utf-8 to ASCII in c++?

like image 804
Suri Avatar asked Jun 05 '10 12:06

Suri


People also ask

Does C use ASCII of UTF-8?

Most C code that deals with strings on a byte-by-byte basis still works, since UTF-8 is fully compatible with 7-bit ASCII.

Is UTF-8 the same as ASCII?

For characters represented by the 7-bit ASCII character codes, the UTF-8 representation is exactly equivalent to ASCII, allowing transparent round trip migration. Other Unicode characters are represented in UTF-8 by sequences of up to 6 bytes, though most Western European characters require only 2 bytes3.

Is UTF-8 a subset of ASCII?

In modern times, ASCII is now a subset of UTF-8, not its own scheme. UTF-8 is backwards compatible with ASCII.


4 Answers

First note that ASCII is a 7-bit format. There are 8-bit encodings, if you are after one of these (such as ISO 8859-1) you'll need to be more specific.

To convert an ASCII string to UTF-8, do nothing: they are the same. So if your UTF-8 string is composed only of ASCII characters, then it is already an ASCII string, and no conversion is necessary.

If the UTF-8 string contains non-ASCII characters (anything with accents or non-Latin characters), there is no way to convert it to ASCII. (You may be able to convert it to one of the ISO encodings perhaps.)

There are ways to strip the accents from Latin characters to get at least some resemblance in ASCII. Alternatively if you just want to delete the non-ASCII characters, simply delete all bytes with values >= 128 from the utf-8 string.

like image 62
Artelius Avatar answered Oct 23 '22 04:10

Artelius


This example works under Windows (you did not mention your target operating system):

    // The sample buffer contains "©ha®a©te®s" in UTF-8
    unsigned char buffer[15] = { 0xc2, 0xa9, 0x68, 0x61, 0xc2, 0xae, 0x61, 0xc2, 0xa9, 0x74, 0x65, 0xc2, 0xae, 0x73, 0x00 };
    // utf8 is the pointer to your UTF-8 string
    char* utf8 = (char*)buffer;
    // convert multibyte UTF-8 to wide string UTF-16
    int length = MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)utf8, -1, NULL, 0);
    if (length > 0)
    {
        wchar_t* wide = new wchar_t[length];
        MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)utf8, -1, wide, length);

        // convert it to ANSI, use setlocale() to set your locale, if not set
        size_t convertedChars = 0;
        char* ansi = new char[length];
        wcstombs_s(&convertedChars, ansi, length, wide, _TRUNCATE);
    }

Remember to delete[] wide; and/or ansi when no longer needed. Since this is unicode, I'd recommend to stick to wchar_t* instead of char* unless you are certain that input buffer contains characters that belong to the same ANSI subset.

like image 39
Aoi Karasu Avatar answered Oct 23 '22 04:10

Aoi Karasu


If the string contains characters which do not exist in ASCII, then there is nothing you can do, because, well, those characters do not exist in ASCII.

If the string contains only characters which do exist in ASCII, then there is nothing you need to do, because the string is already in the ASCII encoding: UTF-8 was specifically designed to be backwards-compatible with ASCII in such a way that any character which is in ASCII has the exact same encoding in UTF-8 as it has in ASCII, and that any character which is not in ASCII can never have an encoding which is valid ASCII, i.e. will always have an encoding which is illegal in ASCII (specifically, any non-ASCII character will be encoded as a sequence of 2–4 octets all of which have their most significant bit set, i.e. have an integer value > 127).

Instead of simply trying to convert the string, you could try to transliterate the string. Most languages on this planet have some form of ASCII transliteration scheme that at least keeps the text somewhat comprehensible. For example, my first name is "Jörg" and its ASCII transliteration would be "Joerg". The name of the creator of the Ruby Programming Language is "まつもとゆきひろ" and its ASCII transliteration would be "Matsumoto Yukihiro". However, please note that you will lose information. For example, the German sz-ligature gets transliterated to "ss", so the word "Maße" (measurements) gets transliterated to "Masse". However, "Masse" (mass, in the physicist's sense, not the Christian's) is also a word. As another example, Turkish has 4 "i"s (small and capital, with and without dot) and ASCII only has 2 (small with dot and capital without dot), therefore you will either lose information about the dot or whether or not it was a capital letter.

So, the only way which will not lose information (in other words: corrupt data), is to somehow encode the non-ASCII characters into sequences of ASCII characters. There are many popular encoding schemes: SGML entity references, MIME, Unicode escape sequences, ΤΕΧ or LaΤΕΧ. So, you would encode the data as it enters your system and decode it when it leaves the system.

Of course, the easiest way would be to simply fix your system.

like image 23
Jörg W Mittag Avatar answered Oct 23 '22 02:10

Jörg W Mittag


UTF-8 is an encoding that can map every unicode character. ASCII only supports a very small subset of unicode.

For the subset of unicode that is ASCII, the mapping from UTF-8 to ASCII is a direct one-to-one byte mapping, so if the server sends you a document that only contains ASCII characters in UTF-8 encoding then you can directly read that as ASCII.

If the response contains non-ASCII characters then, whatever you do, you won't be able to express them in ASCII. To filter these out of a UTF-8 stream you can just filter out any byte >= 128 (0x80 hex).

like image 39
CB Bailey Avatar answered Oct 23 '22 02:10

CB Bailey