Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

iconv separates accents from letter when using libiconv

I'm trying to make a function that would return a given string without its accents, but iconv's //TRANSLIT option only seems to separate the accent and the letter without removing the accent.

Here's my function :

<?php
function strRemoveAccents($str)
{
    return iconv(mb_detect_encoding($str), 'us-ascii//TRANSLIT', $str);
}

And here are my results :

  • test 1

    • Input : Athènes
    • Expected output : Athenes
    • Current output : Ath`enes
  • test 2

    • Input : Gdańsk
    • Expected output : Gdansk
    • Current output : Gda'nsk
  • test 3

    • Input : niño
    • Expected output : nino
    • Current output : ni~no

Some precisions :

  • mb_detect_encoding returns 'UTF-8' for all of my tests, and replacing the function with its return does not change anything.
  • My locale is currently set to LC_COLLATE=C;LC_CTYPE=French_France.1252;LC_MONETARY=C;LC_NUMERIC=C;LC_TIME=C
  • I also tried changing the locale to en_US.UTF-8 (I checked : the locale was successfully updated), but the function's return was still the same
  • Tested on a Macbook with the default locale set to c/fr_FR.UTF-8/c/c/c/c the problem is still the same.
  • I could remove the accents, but since I'll be using the method on whole sentences, I don't want to remove more apostrophes than needed.
  • Edit : when testing with this sandbox, I get the results I want.

I'm probably missing something, but I don't see what.

Edit : As mentioned by @jasonwubz on his answer, the problem is only present when using libinconv, and not when using glibc. Is there a way to make it work when using any of these implementations ?

like image 446
Marc Brillault Avatar asked Aug 25 '19 18:08

Marc Brillault


People also ask

How does libiconv handle Unicode names?

For the libiconv implementation, the names are printed in upper case, separated by whitespace, and alias names of an encoding are listed on the same line as the encoding itself. converts input from the old West-European encoding ISO−8859−1 to Unicode.

What is iconv character set conversion?

iconv − character set conversion The iconv program converts text from one encoding to another encoding. More precisely, it converts from the encoding given for the −f option to the encoding given for the −t option. Either of these encodings defaults to the encoding of the current locale.

What is the difference between iconv_open() and iconv() function?

Function iconv_open () has as its first parameter OUTPUT encoding, and as second parameter INPUT encoding. While iconv () function has as first parameters INPUT variables, and as second parameters OUTPUT variables. This inconsistency is really confusing.

How do I enable transliteration in iconv?

Most versions of iconv will allow transliteration by appending //TRANSLIT to the to encoding: see the examples. Encoding "ASCII" is accepted, and on most systems "C" and "POSIX" are synonyms for ASCII.


1 Answers

The problem with the diacritics is that they are processed differently according to the language, for example in Arabic, diacritics are considered a character that has it's own Unicode code point, and when they join the Arabic letters they still a different character than the parent letter, for example this is a Meem letter "م" and this is a Dammah Diacritic "ُ" when the Dammah joins the Meem they will be 2 characters in the string. That is why you can post nearly empty posts on SE network with these types of diacritics

so removing these diacritic from a string is as simple as searching for these ~8 diacritic and replace them with empty string, while keeping the parent letters untouched.

$withoutDiacritic = str_replace(['ٌ','ُ','ً','َ','ٍ'], "", $string);

The problem with the Latin characters is different, when a diacritic joins a letter they produce 1 letter character with it's own Unicode code point. For example when you join a diacritic to the letter "e" it will be converted to another Unicode character "è" so you can't apply what we do in Arabic diacritics by searching for the diacritics and removing them, instead you must search for "è" character and replace it with "e", and that is what node diacritics does.

I made a PHP version of node diacritics , don't forget to star these guys as they did all the heavy lifting.

<?php

namespace PHPDiacritics;

class PHPDiacritics
{

    protected $replacementList = [
        ["base" => " ", "chars" => '"\u00A0"'],
        ["base" => "0", "chars" => '"\u07C0"'],
        ["base" => "A", "chars" => '"\u24B6\uFF21\u00C0\u00C1\u00C2\u1EA6\u1EA4\u1EAA\u1EA8\u00C3\u0100\u0102\u1EB0\u1EAE\u1EB4\u1EB2\u0226\u01E0\u00C4\u01DE\u1EA2\u00C5\u01FA\u01CD\u0200\u0202\u1EA0\u1EAC\u1EB6\u1E00\u0104\u023A\u2C6F"'],
        ["base" => "AA", "chars" => '"\uA732"'],
        ["base" => "AE", "chars" => '"\u00C6\u01FC\u01E2"'],
        ["base" => "AO", "chars" => '"\uA734"'],
        ["base" => "AU", "chars" => '"\uA736"'],
        ["base" => "AV", "chars" => '"\uA738\uA73A"'],
        ["base" => "AY", "chars" => '"\uA73C"'],
        ["base" => "B", "chars" => '"\u24B7\uFF22\u1E02\u1E04\u1E06\u0243\u0181"'],
        ["base" => "C", "chars" => '"\u24b8\uff23\uA73E\u1E08\u0106\u0043\u0108\u010A\u010C\u00C7\u0187\u023B"'],
        ["base" => "D", "chars" => '"\u24B9\uFF24\u1E0A\u010E\u1E0C\u1E10\u1E12\u1E0E\u0110\u018A\u0189\u1D05\uA779"'],
        ["base" => "Dh", "chars" => '"\u00D0"'],
        ["base" => "DZ", "chars" => '"\u01F1\u01C4"'],
        ["base" => "Dz", "chars" => '"\u01F2\u01C5"'],
        ["base" => "E", "chars" => '"\u025B\u24BA\uFF25\u00C8\u00C9\u00CA\u1EC0\u1EBE\u1EC4\u1EC2\u1EBC\u0112\u1E14\u1E16\u0114\u0116\u00CB\u1EBA\u011A\u0204\u0206\u1EB8\u1EC6\u0228\u1E1C\u0118\u1E18\u1E1A\u0190\u018E\u1D07"'],
        ["base" => "F", "chars" => '"\uA77C\u24BB\uFF26\u1E1E\u0191\uA77B"'],
        ["base" => "G", "chars" => '"\u24BC\uFF27\u01F4\u011C\u1E20\u011E\u0120\u01E6\u0122\u01E4\u0193\uA7A0\uA77D\uA77E\u0262"'],
        ["base" => "H", "chars" => '"\u24BD\uFF28\u0124\u1E22\u1E26\u021E\u1E24\u1E28\u1E2A\u0126\u2C67\u2C75\uA78D"'],
        ["base" => "I", "chars" => '"\u24BE\uFF29\u00CC\u00CD\u00CE\u0128\u012A\u012C\u0130\u00CF\u1E2E\u1EC8\u01CF\u0208\u020A\u1ECA\u012E\u1E2C\u0197"'],
        ["base" => "J", "chars" => '"\u24BF\uFF2A\u0134\u0248\u0237"'],
        ["base" => "K", "chars" => '"\u24C0\uFF2B\u1E30\u01E8\u1E32\u0136\u1E34\u0198\u2C69\uA740\uA742\uA744\uA7A2"'],
        ["base" => "L", "chars" => '"\u24C1\uFF2C\u013F\u0139\u013D\u1E36\u1E38\u013B\u1E3C\u1E3A\u0141\u023D\u2C62\u2C60\uA748\uA746\uA780"'],
        ["base" => "LJ", "chars" => '"\u01C7"'],
        ["base" => "Lj", "chars" => '"\u01C8"'],
        ["base" => "M", "chars" => '"\u24C2\uFF2D\u1E3E\u1E40\u1E42\u2C6E\u019C\u03FB"'],
        ["base" => "N", "chars" => '"\uA7A4\u0220\u24C3\uFF2E\u01F8\u0143\u00D1\u1E44\u0147\u1E46\u0145\u1E4A\u1E48\u019D\uA790\u1D0E"'],
        ["base" => "NJ", "chars" => '"\u01CA"'],
        ["base" => "Nj", "chars" => '"\u01CB"'],
        ["base" => "O", "chars" => '"\u24C4\uFF2F\u00D2\u00D3\u00D4\u1ED2\u1ED0\u1ED6\u1ED4\u00D5\u1E4C\u022C\u1E4E\u014C\u1E50\u1E52\u014E\u022E\u0230\u00D6\u022A\u1ECE\u0150\u01D1\u020C\u020E\u01A0\u1EDC\u1EDA\u1EE0\u1EDE\u1EE2\u1ECC\u1ED8\u01EA\u01EC\u00D8\u01FE\u0186\u019F\uA74A\uA74C"'],
        ["base" => "OE", "chars" => '"\u0152"'],
        ["base" => "OI", "chars" => '"\u01A2"'],
        ["base" => "OO", "chars" => '"\uA74E"'],
        ["base" => "OU", "chars" => '"\u0222"'],
        ["base" => "P", "chars" => '"\u24C5\uFF30\u1E54\u1E56\u01A4\u2C63\uA750\uA752\uA754"'],
        ["base" => "Q", "chars" => '"\u24C6\uFF31\uA756\uA758\u024A"'],
        ["base" => "R", "chars" => '"\u24C7\uFF32\u0154\u1E58\u0158\u0210\u0212\u1E5A\u1E5C\u0156\u1E5E\u024C\u2C64\uA75A\uA7A6\uA782"'],
        ["base" => "S", "chars" => '"\u24C8\uFF33\u1E9E\u015A\u1E64\u015C\u1E60\u0160\u1E66\u1E62\u1E68\u0218\u015E\u2C7E\uA7A8\uA784"'],
        ["base" => "T", "chars" => '"\u24C9\uFF34\u1E6A\u0164\u1E6C\u021A\u0162\u1E70\u1E6E\u0166\u01AC\u01AE\u023E\uA786"'],
        ["base" => "Th", "chars" => '"\u00DE"'],
        ["base" => "TZ", "chars" => '"\uA728"'],
        ["base" => "U", "chars" => '"\u24CA\uFF35\u00D9\u00DA\u00DB\u0168\u1E78\u016A\u1E7A\u016C\u00DC\u01DB\u01D7\u01D5\u01D9\u1EE6\u016E\u0170\u01D3\u0214\u0216\u01AF\u1EEA\u1EE8\u1EEE\u1EEC\u1EF0\u1EE4\u1E72\u0172\u1E76\u1E74\u0244"'],
        ["base" => "V", "chars" => '"\u24CB\uFF36\u1E7C\u1E7E\u01B2\uA75E\u0245"'],
        ["base" => "VY", "chars" => '"\uA760"'],
        ["base" => "W", "chars" => '"\u24CC\uFF37\u1E80\u1E82\u0174\u1E86\u1E84\u1E88\u2C72"'],
        ["base" => "X", "chars" => '"\u24CD\uFF38\u1E8A\u1E8C"'],
        ["base" => "Y", "chars" => '"\u24CE\uFF39\u1EF2\u00DD\u0176\u1EF8\u0232\u1E8E\u0178\u1EF6\u1EF4\u01B3\u024E\u1EFE"'],
        ["base" => "Z", "chars" => '"\u24CF\uFF3A\u0179\u1E90\u017B\u017D\u1E92\u1E94\u01B5\u0224\u2C7F\u2C6B\uA762"'],
        ["base" => "a", "chars" => '"\u24D0\uFF41\u1E9A\u00E0\u00E1\u00E2\u1EA7\u1EA5\u1EAB\u1EA9\u00E3\u0101\u0103\u1EB1\u1EAF\u1EB5\u1EB3\u0227\u01E1\u00E4\u01DF\u1EA3\u00E5\u01FB\u01CE\u0201\u0203\u1EA1\u1EAD\u1EB7\u1E01\u0105\u2C65\u0250\u0251"'],
        ["base" => "aa", "chars" => '"\uA733"'],
        ["base" => "ae", "chars" => '"\u00E6\u01FD\u01E3"'],
        ["base" => "ao", "chars" => '"\uA735"'],
        ["base" => "au", "chars" => '"\uA737"'],
        ["base" => "av", "chars" => '"\uA739\uA73B"'],
        ["base" => "ay", "chars" => '"\uA73D"'],
        ["base" => "b", "chars" => '"\u24D1\uFF42\u1E03\u1E05\u1E07\u0180\u0183\u0253\u0182"'],
        ["base" => "c", "chars" => '"\uFF43\u24D2\u0107\u0109\u010B\u010D\u00E7\u1E09\u0188\u023C\uA73F\u2184"'],
        ["base" => "d", "chars" => '"\u24D3\uFF44\u1E0B\u010F\u1E0D\u1E11\u1E13\u1E0F\u0111\u018C\u0256\u0257\u018B\u13E7\u0501\uA7AA"'],
        ["base" => "dh", "chars" => '"\u00F0"'],
        ["base" => "dz", "chars" => '"\u01F3\u01C6"'],
        ["base" => "e", "chars" => '"\u24D4\uFF45\u00E8\u00E9\u00EA\u1EC1\u1EBF\u1EC5\u1EC3\u1EBD\u0113\u1E15\u1E17\u0115\u0117\u00EB\u1EBB\u011B\u0205\u0207\u1EB9\u1EC7\u0229\u1E1D\u0119\u1E19\u1E1B\u0247\u01DD"'],
        ["base" => "f", "chars" => '"\u24D5\uFF46\u1E1F\u0192"'],
        ["base" => "ff", "chars" => '"\uFB00"'],
        ["base" => "fi", "chars" => '"\uFB01"'],
        ["base" => "fl", "chars" => '"\uFB02"'],
        ["base" => "ffi", "chars" => '"\uFB03"'],
        ["base" => "ffl", "chars" => '"\uFB04"'],
        ["base" => "g", "chars" => '"\u24D6\uFF47\u01F5\u011D\u1E21\u011F\u0121\u01E7\u0123\u01E5\u0260\uA7A1\uA77F\u1D79"'],
        ["base" => "h", "chars" => '"\u24D7\uFF48\u0125\u1E23\u1E27\u021F\u1E25\u1E29\u1E2B\u1E96\u0127\u2C68\u2C76\u0265"'],
        ["base" => "hv", "chars" => '"\u0195"'],
        ["base" => "i", "chars" => '"\u24D8\uFF49\u00EC\u00ED\u00EE\u0129\u012B\u012D\u00EF\u1E2F\u1EC9\u01D0\u0209\u020B\u1ECB\u012F\u1E2D\u0268\u0131"'],
        ["base" => "j", "chars" => '"\u24D9\uFF4A\u0135\u01F0\u0249"'],
        ["base" => "k", "chars" => '"\u24DA\uFF4B\u1E31\u01E9\u1E33\u0137\u1E35\u0199\u2C6A\uA741\uA743\uA745\uA7A3"'],
        ["base" => "l", "chars" => '"\u24DB\uFF4C\u0140\u013A\u013E\u1E37\u1E39\u013C\u1E3D\u1E3B\u017F\u0142\u019A\u026B\u2C61\uA749\uA781\uA747\u026D"'],
        ["base" => "lj", "chars" => '"\u01C9"'],
        ["base" => "m", "chars" => '"\u24DC\uFF4D\u1E3F\u1E41\u1E43\u0271\u026F"'],
        ["base" => "n", "chars" => '"\u24DD\uFF4E\u01F9\u0144\u00F1\u1E45\u0148\u1E47\u0146\u1E4B\u1E49\u019E\u0272\u0149\uA791\uA7A5\u043B\u0509"'],
        ["base" => "nj", "chars" => '"\u01CC"'],
        ["base" => "o", "chars" => '"\u24DE\uFF4F\u00F2\u00F3\u00F4\u1ED3\u1ED1\u1ED7\u1ED5\u00F5\u1E4D\u022D\u1E4F\u014D\u1E51\u1E53\u014F\u022F\u0231\u00F6\u022B\u1ECF\u0151\u01D2\u020D\u020F\u01A1\u1EDD\u1EDB\u1EE1\u1EDF\u1EE3\u1ECD\u1ED9\u01EB\u01ED\u00F8\u01FF\uA74B\uA74D\u0275\u0254\u1D11"'],
        ["base" => "oe", "chars" => '"\u0153"'],
        ["base" => "oi", "chars" => '"\u01A3"'],
        ["base" => "oo", "chars" => '"\uA74F"'],
        ["base" => "ou", "chars" => '"\u0223"'],
        ["base" => "p", "chars" => '"\u24DF\uFF50\u1E55\u1E57\u01A5\u1D7D\uA751\uA753\uA755\u03C1"'],
        ["base" => "q", "chars" => '"\u24E0\uFF51\u024B\uA757\uA759"'],
        ["base" => "r", "chars" => '"\u24E1\uFF52\u0155\u1E59\u0159\u0211\u0213\u1E5B\u1E5D\u0157\u1E5F\u024D\u027D\uA75B\uA7A7\uA783"'],
        ["base" => "s", "chars" => '"\u24E2\uFF53\u015B\u1E65\u015D\u1E61\u0161\u1E67\u1E63\u1E69\u0219\u015F\u023F\uA7A9\uA785\u1E9B\u0282"'],
        ["base" => "ss", "chars" => '"\u00DF"'],
        ["base" => "t", "chars" => '"\u24E3\uFF54\u1E6B\u1E97\u0165\u1E6D\u021B\u0163\u1E71\u1E6F\u0167\u01AD\u0288\u2C66\uA787"'],
        ["base" => "th", "chars" => '"\u00FE"'],
        ["base" => "tz", "chars" => '"\uA729"'],
        ["base" => "u", "chars" => '"\u24E4\uFF55\u00F9\u00FA\u00FB\u0169\u1E79\u016B\u1E7B\u016D\u00FC\u01DC\u01D8\u01D6\u01DA\u1EE7\u016F\u0171\u01D4\u0215\u0217\u01B0\u1EEB\u1EE9\u1EEF\u1EED\u1EF1\u1EE5\u1E73\u0173\u1E77\u1E75\u0289"'],
        ["base" => "v", "chars" => '"\u24E5\uFF56\u1E7D\u1E7F\u028B\uA75F\u028C"'],
        ["base" => "vy", "chars" => '"\uA761"'],
        ["base" => "w", "chars" => '"\u24E6\uFF57\u1E81\u1E83\u0175\u1E87\u1E85\u1E98\u1E89\u2C73"'],
        ["base" => "x", "chars" => '"\u24E7\uFF58\u1E8B\u1E8D"'],
        ["base" => "y", "chars" => '"\u24E8\uFF59\u1EF3\u00FD\u0177\u1EF9\u0233\u1E8F\u00FF\u1EF7\u1E99\u1EF5\u01B4\u024F\u1EFF"'],
        ["base" => "z", "chars" => '"\u24E9\uFF5A\u017A\u1E91\u017C\u017E\u1E93\u1E95\u01B6\u0225\u0240\u2C6C\uA763"']
    ];

    protected $chars = [];

    protected $encoding;

    public function __construct($encoding = "")
    {
        if (!$encoding) $encoding = mb_internal_encoding();
        if (!$encoding) $encoding = 'UTF-8';
        /*
         *you can filter the encodings here with the supported encodings of mb_* functions
         *https://www.php.net/manual/en/mbstring.supported-encodings.php
         *but I will leave mb_* functions generate error of level E_WARNING if unsupported encoding is used
         */
        $this->encoding = $encoding;

        //$charsCountTotal = 0; // for debugging
        //build the indexed array chars for better performance
        foreach ($this->replacementList as $replacementList){
            $charsString = json_decode($replacementList["chars"]);
            //if(!$charsString) die('noooooooooooooooooo'); // debugging
            $charsCount = mb_strlen($charsString, $this->encoding);
            //$charsCountTotal += $charsCount; // for debugging
            for($i = 0; $i < $charsCount; $i++){
                $char = mb_substr($charsString, $i, 1, $this->encoding);
                $this->chars[$char] = $replacementList["base"];
            }
        }
        //echo "chars count" . $charsCountTotal . "\n"; // for debugging
        //echo "array count" . count($this->chars) . "\n"; // for debugging

    }

    public function removeDiacritics($string)
    {

        $finalString = "";
        $charsCount = mb_strlen($string, $this->encoding);
        for($i = 0; $i < $charsCount; $i++){
            $char = mb_substr($string, $i, 1, $this->encoding);
            $finalString .= !empty($this->chars[$char]) ? $this->chars[$char] : $char;
        }
        return $finalString;
    }




}

Using the class

$phpDiacritics = new PHPDiacritics('UTF-8');

$test1 = "Athènes";
$test2 = "Gdańsk";
$test3 = "niño";

echo $phpDiacritics->removeDiacritics($test1) . "\n";
echo $phpDiacritics->removeDiacritics($test2) . "\n";
echo $phpDiacritics->removeDiacritics($test3) . "\n";

This outputs

Athenes
Gdansk
nino
like image 149
Accountant م Avatar answered Oct 21 '22 11:10

Accountant م