Convert UTF-8 characters to nearest equivalent ASCII characters using c++ (without winapi)

Question

Does anybody have a code snippet what could convert at least the most common characters for the european languages? For example:

testáén

as a UTF-8 encoded string (i.e. bytes in hex: 74 65 73 74 c3 a1 c3 a9 6e 0)

to

testaen

(I'd like to use c/c++ and std, or small crossplatform libs)

bames53 · Accepted Answer

Here's code that handles converting characters from the ISO-8859-1 range to ascii. A replacement character is used for everything else outside ascii.

#include <codecvt>
#include <array>
#include <string>

#include <iostream>

constexpr char const *rc = "?"; // replacement_char

// table mapping ISO-8859-1 characters to similar ASCII characters
std::array<char const *,96> conversions = {{
   " ",  "!","c","L", rc,"Y", "|","S", rc,"C","a","<<",   rc,  "-",  "R", "-",
    rc,"+/-","2","3","'","u", "P",".",",","1","o",">>","1/4","1/2","3/4", "?", 
   "A",  "A","A","A","A","A","AE","C","E","E","E", "E",  "I",  "I",  "I", "I",
   "D",  "N","O","O","O","O", "O","*","0","U","U", "U",  "U",  "Y",  "P","ss",
   "a",  "a","a","a","a","a","ae","c","e","e","e", "e",  "i",  "i",  "i", "i",
   "d",  "n","o","o","o","o", "o","/","0","u","u", "u",  "u",  "y",  "p", "y"    
}};

template <class Facet>
class usable_facet : public Facet {
public:
    using Facet::Facet;
    ~usable_facet() {}
};

std::string to_ascii(std::string const &utf8) {
    std::wstring_convert<usable_facet<std::codecvt<char32_t,char,std::mbstate_t>>,
                         char32_t> convert;
    std::u32string utf32 = convert.from_bytes(utf8);

    std::string ascii;
    for (char32_t c : utf32) {
        if (c<=U'\u007F')
            ascii.push_back(static_cast<char>(c));
        else if (U'\u00A0'<=c && c<=U'\u00FF')
            ascii.append(conversions[c - U'\u00A0']);
        else
            ascii.append(rc);
    }
    return ascii;
}

int main() {
    std::cout << to_ascii(u8"testáén
");
}

bmargulies · Answer

There is a gigantic collection of Unicode characters that you'd need to handle. So the criteria of 'small' is an impossible criteria. The ICU library contains what you need, but for this reason you won't find it small. You'll need, for example, to deal with both composed and non-composed modifiers.

If you really only care about a small subset of the possible Unicode characters, then you can create your own simple mapping table.

Convert UTF-8 characters to nearest equivalent ASCII characters using c++ (without winapi)

Tags:

c++

ascii

utf-8

Graphyt

2 Answers

bames53

bmargulies

Recent Activity

Donate For Us

Convert UTF-8 characters to nearest equivalent ASCII characters using c++ (without winapi)

Tags:

c++

ascii

utf-8

Graphyt

2 Answers

bames53

bmargulies

Related questions

Recent Activity

Donate For Us