I need a generic transliteration or substitution regex that will map extended latin characters to similar looking ASCII characters, and all other extended characters to '' (empty string) so that...
é becomes e
ê becomes e
á becomes a
ç becomes c
Ď becomes D
and so on, but things like ‡ or Ω or ‰ just get striped away.
This is easily done on a Windows platform: type the decimal ascii code (on the numeric keypad only) while holding down the ALT key, and the corresponding character is entered. For example, Alt-132 gives you a lowercase "a" with an umlaut.
Inserting ASCII characters To insert an ASCII character, press and hold down ALT while typing the character code. For example, to insert the degree (º) symbol, press and hold down ALT while typing 0176 on the numeric keypad. You must use the numeric keypad to type the numbers, and not the keyboard.
Extended ASCII represents both control characters and printable characters. Control characters are used to perform actions rather than to display a printable character on screen. Easily understood examples include 'Escape', 'Backspace' and 'Delete'.
Extended characters are those which are not in the standard ASCII character set, which uses 7-bit characters and thus has values 0 to 127. ASCII Codes 0 to 31 and 127 are non-printing control characters, while codes 32 to 126 match the keys on a US keyboard ("a", "A", etc.).
Use Unicode::Normalize to get the NFD($str). In this form all the characters with diacritics will be turned into a base character followed by a combining diacritic character. Then simply remove all the non-ASCII characters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With