how to convert utf-8 to ASCII in c++?

4 Answers

First note that ASCII is a 7-bit format. There are 8-bit encodings, if you are after one of these (such as ISO 8859-1) you'll need to be more specific.

To convert an ASCII string to UTF-8, do nothing: they are the same. So if your UTF-8 string is composed only of ASCII characters, then it is already an ASCII string, and no conversion is necessary.

If the UTF-8 string contains non-ASCII characters (anything with accents or non-Latin characters), there is no way to convert it to ASCII. (You may be able to convert it to one of the ISO encodings perhaps.)

There are ways to strip the accents from Latin characters to get at least some resemblance in ASCII. Alternatively if you just want to delete the non-ASCII characters, simply delete all bytes with values >= 128 from the utf-8 string.

answered Oct 23 '22 04:10

Artelius

This example works under Windows (you did not mention your target operating system):

    // The sample buffer contains "©ha®a©te®s" in UTF-8
    unsigned char buffer[15] = { 0xc2, 0xa9, 0x68, 0x61, 0xc2, 0xae, 0x61, 0xc2, 0xa9, 0x74, 0x65, 0xc2, 0xae, 0x73, 0x00 };
    // utf8 is the pointer to your UTF-8 string
    char* utf8 = (char*)buffer;
    // convert multibyte UTF-8 to wide string UTF-16
    int length = MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)utf8, -1, NULL, 0);
    if (length > 0)
    {
        wchar_t* wide = new wchar_t[length];
        MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)utf8, -1, wide, length);

        // convert it to ANSI, use setlocale() to set your locale, if not set
        size_t convertedChars = 0;
        char* ansi = new char[length];
        wcstombs_s(&convertedChars, ansi, length, wide, _TRUNCATE);
    }

Remember to delete[] wide; and/or ansi when no longer needed. Since this is unicode, I'd recommend to stick to wchar_t* instead of char* unless you are certain that input buffer contains characters that belong to the same ANSI subset.

answered Oct 23 '22 04:10

Aoi Karasu

If the string contains characters which do not exist in ASCII, then there is nothing you can do, because, well, those characters do not exist in ASCII.

If the string contains only characters which do exist in ASCII, then there is nothing you need to do, because the string is already in the ASCII encoding: UTF-8 was specifically designed to be backwards-compatible with ASCII in such a way that any character which is in ASCII has the exact same encoding in UTF-8 as it has in ASCII, and that any character which is not in ASCII can never have an encoding which is valid ASCII, i.e. will always have an encoding which is illegal in ASCII (specifically, any non-ASCII character will be encoded as a sequence of 2–4 octets all of which have their most significant bit set, i.e. have an integer value > 127).

Instead of simply trying to convert the string, you could try to transliterate the string. Most languages on this planet have some form of ASCII transliteration scheme that at least keeps the text somewhat comprehensible. For example, my first name is "Jörg" and its ASCII transliteration would be "Joerg". The name of the creator of the Ruby Programming Language is "まつもとゆきひろ" and its ASCII transliteration would be "Matsumoto Yukihiro". However, please note that you will lose information. For example, the German sz-ligature gets transliterated to "ss", so the word "Maße" (measurements) gets transliterated to "Masse". However, "Masse" (mass, in the physicist's sense, not the Christian's) is also a word. As another example, Turkish has 4 "i"s (small and capital, with and without dot) and ASCII only has 2 (small with dot and capital without dot), therefore you will either lose information about the dot or whether or not it was a capital letter.

So, the only way which will not lose information (in other words: corrupt data), is to somehow encode the non-ASCII characters into sequences of ASCII characters. There are many popular encoding schemes: SGML entity references, MIME, Unicode escape sequences, Τ_ΕΧ or LaΤ_ΕΧ. So, you would encode the data as it enters your system and decode it when it leaves the system.

Of course, the easiest way would be to simply fix your system.

answered Oct 23 '22 02:10

Jörg W Mittag

UTF-8 is an encoding that can map every unicode character. ASCII only supports a very small subset of unicode.

For the subset of unicode that is ASCII, the mapping from UTF-8 to ASCII is a direct one-to-one byte mapping, so if the server sends you a document that only contains ASCII characters in UTF-8 encoding then you can directly read that as ASCII.

If the response contains non-ASCII characters then, whatever you do, you won't be able to express them in ASCII. To filter these out of a UTF-8 stream you can just filter out any byte >= 128 (0x80 hex).

answered Oct 23 '22 02:10

CB Bailey

Related questions
                            
                                Should I really use static_cast every single time I want to convert between primitive types?
                            
                                Resize SDL2 window?
                            
                                C++ get the difference between two vectors
                            
                                How to prevent call to base implementation of a method
                            
                                Difference between const declarations in C++
                            
                                operator[][] C++
                            
                                Whats the right approach for error handling in C++
                            
                                C++ | Generating a pseudo number between 10-20
                            
                                How should I classify a typedef with Doxygen?
                            
                                Is there a way to use standalone `std::begin` and for a const_iterator?
                            
                                How can I access argc and argv in c++ from a library function
                            
                                Prevent creation of class whose member functions are all static
                            
                                Is it more efficient to branch or multiply?
                            
                                How to define a 2D array in C++ and STL without memory manipulation?
                            
                                a small issue with std::vector and changing the collection while looping through it
                            
                                What type is a macro considered? [duplicate]
                            
                                How to revive C++ skills
                            
                                C++ DateTime class
                            
                                Fastest way to obtain the largest X numbers from a very large unsorted list?
                            
                                I just don't get the C++ Pointer/Reference system

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to convert utf-8 to ASCII in c++?

Tags:

c++

Suri

People also ask