Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert UTF-8 characters to nearest equivalent ASCII characters using c++ (without winapi)

Tags:

c++

ascii

utf-8

Does anybody have a code snippet what could convert at least the most common characters for the european languages? For example:

testáén

as a UTF-8 encoded string (i.e. bytes in hex: 74 65 73 74 c3 a1 c3 a9 6e 0)

to

testaen

(I'd like to use c/c++ and std, or small crossplatform libs)

like image 435
Graphyt Avatar asked Dec 15 '22 23:12

Graphyt


2 Answers

Here's code that handles converting characters from the ISO-8859-1 range to ascii. A replacement character is used for everything else outside ascii.

#include <codecvt>
#include <array>
#include <string>

#include <iostream>

constexpr char const *rc = "?"; // replacement_char

// table mapping ISO-8859-1 characters to similar ASCII characters
std::array<char const *,96> conversions = {{
   " ",  "!","c","L", rc,"Y", "|","S", rc,"C","a","<<",   rc,  "-",  "R", "-",
    rc,"+/-","2","3","'","u", "P",".",",","1","o",">>","1/4","1/2","3/4", "?", 
   "A",  "A","A","A","A","A","AE","C","E","E","E", "E",  "I",  "I",  "I", "I",
   "D",  "N","O","O","O","O", "O","*","0","U","U", "U",  "U",  "Y",  "P","ss",
   "a",  "a","a","a","a","a","ae","c","e","e","e", "e",  "i",  "i",  "i", "i",
   "d",  "n","o","o","o","o", "o","/","0","u","u", "u",  "u",  "y",  "p", "y"    
}};

template <class Facet>
class usable_facet : public Facet {
public:
    using Facet::Facet;
    ~usable_facet() {}
};

std::string to_ascii(std::string const &utf8) {
    std::wstring_convert<usable_facet<std::codecvt<char32_t,char,std::mbstate_t>>,
                         char32_t> convert;
    std::u32string utf32 = convert.from_bytes(utf8);

    std::string ascii;
    for (char32_t c : utf32) {
        if (c<=U'\u007F')
            ascii.push_back(static_cast<char>(c));
        else if (U'\u00A0'<=c && c<=U'\u00FF')
            ascii.append(conversions[c - U'\u00A0']);
        else
            ascii.append(rc);
    }
    return ascii;
}

int main() {
    std::cout << to_ascii(u8"testáén\n");
}
like image 59
bames53 Avatar answered Dec 21 '22 23:12

bames53


There is a gigantic collection of Unicode characters that you'd need to handle. So the criteria of 'small' is an impossible criteria. The ICU library contains what you need, but for this reason you won't find it small. You'll need, for example, to deal with both composed and non-composed modifiers.

If you really only care about a small subset of the possible Unicode characters, then you can create your own simple mapping table.

like image 45
bmargulies Avatar answered Dec 21 '22 23:12

bmargulies