Does anybody have a code snippet what could convert at least the most common characters for the european languages? For example:
testáén
as a UTF-8 encoded string (i.e. bytes in hex: 74 65 73 74 c3 a1 c3 a9 6e 0)
to
testaen
(I'd like to use c/c++ and std, or small crossplatform libs)
Here's code that handles converting characters from the ISO-8859-1 range to ascii. A replacement character is used for everything else outside ascii.
#include <codecvt>
#include <array>
#include <string>
#include <iostream>
constexpr char const *rc = "?"; // replacement_char
// table mapping ISO-8859-1 characters to similar ASCII characters
std::array<char const *,96> conversions = {{
" ", "!","c","L", rc,"Y", "|","S", rc,"C","a","<<", rc, "-", "R", "-",
rc,"+/-","2","3","'","u", "P",".",",","1","o",">>","1/4","1/2","3/4", "?",
"A", "A","A","A","A","A","AE","C","E","E","E", "E", "I", "I", "I", "I",
"D", "N","O","O","O","O", "O","*","0","U","U", "U", "U", "Y", "P","ss",
"a", "a","a","a","a","a","ae","c","e","e","e", "e", "i", "i", "i", "i",
"d", "n","o","o","o","o", "o","/","0","u","u", "u", "u", "y", "p", "y"
}};
template <class Facet>
class usable_facet : public Facet {
public:
using Facet::Facet;
~usable_facet() {}
};
std::string to_ascii(std::string const &utf8) {
std::wstring_convert<usable_facet<std::codecvt<char32_t,char,std::mbstate_t>>,
char32_t> convert;
std::u32string utf32 = convert.from_bytes(utf8);
std::string ascii;
for (char32_t c : utf32) {
if (c<=U'\u007F')
ascii.push_back(static_cast<char>(c));
else if (U'\u00A0'<=c && c<=U'\u00FF')
ascii.append(conversions[c - U'\u00A0']);
else
ascii.append(rc);
}
return ascii;
}
int main() {
std::cout << to_ascii(u8"testáén\n");
}
There is a gigantic collection of Unicode characters that you'd need to handle. So the criteria of 'small' is an impossible criteria. The ICU library contains what you need, but for this reason you won't find it small. You'll need, for example, to deal with both composed and non-composed modifiers.
If you really only care about a small subset of the possible Unicode characters, then you can create your own simple mapping table.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With