Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I normalize / asciify Unicode characters in Google Sheets?

I'm trying to write a formula for Google Sheets which will convert Unicode characters with diacritics to their plain ASCII equivalents.

I see that Google uses RE2 in its "REGEXREPLACE" function. And I see that RE2 offers Unicode character classes.

I tried to write a formula (similar to this one):

REGEXREPLACE("público","(\pL)\pM*","$1")

But Sheets produces the following error:

Function REGEXREPLACE parameter 2 value "\pL" is not a valid regular expression.

I suppose I could write a formula consisting of a long set of nested SUBSTITUTE functions (Like this one), but that seems pretty awful.

Can any offer a suggestion for a better way to normalize Unicode letters with diacritical/accent marks in a Google Sheets formula?

like image 742
Kirkman14 Avatar asked Feb 25 '16 23:02

Kirkman14


3 Answers

[[:^alpha:]] (negated ASCII character class) works fine for REGEXEXTRACT formula.

But =REGEXREPLACE("público","([[:alpha:]])[[:^alpha:]]","$1") gives "pblic" as a result. So, I guess, formula doesn't know what exact ASCII character must replace "ú".


Workaround

Let's take the word públicē; we need to replace two symbols in it. Put this word in cell A1, and this formula in cell B1:

=JOIN("",ArrayFormula(IFERROR(VLOOKUP(SPLIT(REGEXREPLACE(A1,"(.)","$1-"),"-"),D:E,2,0),SPLIT(REGEXREPLACE(A1,"(.)","$1-"),"-"))))

And then make directory of replacements in range D:E:

    D    E  
1   ú   u
2   ē   e
3  ...  ...

This formula is still ugly, but more useful because you can control your directory by adding more characters to the table.


Or use Java Script

Also found a good solution, which works in google sheets.

like image 154
Max Makhrov Avatar answered Oct 08 '22 05:10

Max Makhrov


This did it for me in Google Sheets, Google Apps Scripts, GAS

function normalizetext(text) {
    var weird = 'öüóőúéáàűíÖÜÓŐÚÉÁÀŰÍçÇ!@£$%^&*()_+?/*."';
    var normalized = 'ouooueaauiOUOOUEAAUIcC                 ';
    var idoff = -1,new_text = '';
    var lentext = text.toString().length -1

    for (i = 0; i <= lentext; i++) {
        idoff = weird.search(text.charAt(i));
        if (idoff == -1) {
            new_text = new_text + text.charAt(i);
        } else {
           new_text = new_text + normalized.charAt(idoff);
        }
    }

    return new_text;
}
like image 29
JaimeJCandau Avatar answered Oct 08 '22 06:10

JaimeJCandau


This answer doesn't require a Google App Script, and it's still fast, and relatively simple. It builds on Max's answer by providing a full lookup table, and it also allows for case-sensitive transliteration (normally VLOOKUP is NOT case-sensitive).

Here is a link to the Google Spreadsheet if you want to jump right into it. If you want to use your own sheet, you'll need to copy the TRANS_TABLE sheet into your Spreadsheet.

In the code snippet below, the source cell is A2, so you'd place this formula in any column on row 2. Using REGEXREPLACE AND SPLIT, we split apart the string in A2 into an array of characters, then USING ARRAYFORMULA, we do the following to EACH character in the array: First, the character is converted to its 'decimal' CODE equivalent, then matched against a table on the TRANS_TABLE sheet by that number, then using VLOOKUP, a character X number of columns over (the index value provided) on the TRANS_TABLE sheet (in this case, the 3rd column over) is returned. When all characters in the array have been transliterated, we finally JOIN the array of characters back into a single string. I provided examples with named ranges as well.

=iferror(
join(
  "",
  ARRAYFORMULA(
    vlookup(
      code(split(REGEXREPLACE($A2,"(.)", "$1;"),";",TRUE)),
      TRANS_TABLE!$A$5:$F,3
    )
  )
)
,)

You'll note on the TRANS_TABLE sheet I made, I created 4 different transliteration columns, which makes it easy to have a column for each of your transliteration needs. To reference the column, just use a different index number in the VLOOKUP. Each column is simply a replacement character column. In some cases, you don't want any conversion made (A -> A or 3 -> 3), so you just copy the same character from the source Glyph column. Where you DO want to convert characters, you type in whatever character you want replaced (ñ -> n etc). If you want a character removed altogether, you leave the cell blank (? -> ''). You can see examples of the transliteration output on the data sheet in which I created 4 different transliteration columns (A-D) referencing each of the Transliteration tables from the TRANS_TABLE sheet for different use case scenarios.

I hope this finally answers your question in a fashion that isn't so "ugly." Cheers.

like image 2
Doomd Avatar answered Oct 08 '22 04:10

Doomd