How to Remove Diacritic Marks (such as Accents) using Unicode Normalization in Standard SQL?

Tags:

1 Answers

The Short Answer

It's actually quite simple after you understand what normalize is doing:

WITH data AS(
  SELECT 'Ãâíüçãõ' AS text
)

SELECT
  REGEXP_REPLACE(NORMALIZE(text, NFD), r'\pM', '') nfd_result,
  REGEXP_REPLACE(NORMALIZE(text, NFKD), r'\pM', '') nfkd_result
FROM data

Results:

Row   nfd_result    nfkd_result  
1     Aaiucao       Aaiucao

You can use either the options "NFD" or "NFKD" and, for the most part, it should work (still you should understand the differences between both options to better deal with your data).

The More Complete Answer

Basically what normalize does is it converts all unicodes in a string to its canonical equivalent (or compatible form) so that we have equivalent reference for comparisons (now understanding this already requires knowing some concepts).

The point is, unicode not only establishes the mapping between numbers (their code points represented by U+) and their glyphs but also some rules of how these points might interact among themselves.

For instance, let's take the glyph á.

We don't have just one unicode for this character. We actually can represent it either like U+00E1 or like U+0061U+0301 which is the unicodes for a and ´.

Yeap! Unicode is defined in a way such that you can combine characters and diacritics and represent their union by just ordering one after the other.

In fact, you can play around with combining diacritics in Unicode using an online conversor:

enter image description here

Unicode defines these types of characters that can combine themselves to diacritics as precomposed characters by using a clever and simple idea: characters that are not precomposed have what is called a 0 (zero) combining class; points that can combine receive a positive combining class (for instance, ´ has class 230) which is used to assert how the final glyph should be represented.

This is quite cool but ends up creating a problem which explains the function normalize we've been discussing since the beginning: if we read two strings, one with unicode U+0061U+0301 and other with U+00E1 (both á), they should be considered equivalent! In fact, it's the same glyph represented in different ways.

This is precisely what normalize is doing. Unicode defines a canonical form for each character so that, when normalized, the end result should be such that if we have two strings with distinct code points for same glyph, we still can see both as equal.

Well, there are basically 2 main possibilities for how we can normalize code points: either composing different unicodes into just one (in our example this would be transforming U+0061U+0301 into U+00E1) or we can decompose (which would be the other way around, transforming U+00E1 into U+0061U+0301).

Here you can see it more clearly:

enter image description here

NF means the canonical equivalent. NFC means to retrieve the canonical composite character (united); NFD is the opposite, decomposes the character.

You can use this information to play around in BigQuery:

WITH data AS(
  SELECT 'Amélie' AS text
)

SELECT
  text,
  TO_CODE_POINTS(NORMALIZE(text, NFC)) nfc_result,
  TO_CODE_POINTS(NORMALIZE(text, NFD)) nfd_result
FROM data

Which results:

enter image description here

Notice the nfd column has one more code point. By now you already know what that is: ´ separated from the e.

If you read BigQuery's documentation for normalize, you'll see it also has support for the types NFKC and NFKD. This type (with letter K) does not normalize by canonical equivalence but rather by "compatibility", that is, it breaks some characters into its constituents letters as well, not only diacritics:

enter image description here

The letter ﬃ (which is not the same as ffi. This type of character is known as ligature) is decomposed also by the letters that constitutes it (and therefore equivalence is lost as ffi may not be the same as ﬃ for some applications, hence the name compatibility form).

Now that we know how to decompose characters into the main glyph followed by its diacritic, we can use a regex to match only them to remove from the string (which is accomplished by the expression \pM which matches diacritics marks only):

WITH data AS(
  SELECT 'café' AS text
)

SELECT
  REGEXP_REPLACE(NORMALIZE(text, NFD), r'\pM', '') nfd_result
FROM data

And that's pretty much all there is (hopefully) to the normalize function and how it's used to remove diacritics. All this information I found thanks to user sigpwned and his answer to this question. As I tried it and it didn't quite work I decided to study some of the theory behind the methods and wanted to write it down :). Hopefully it'll be useful for more people as it definitely was for me.

146

answered Sep 27 '22 19:09

Willian Fuks

Related questions
                            
                                Understanding Python Unicode and Linux terminal
                            
                                Is there an equivalence table to convert ASCII smileys to Unicode emoji(s)?
                            
                                How to replace invalid unicode characters in a string in Python?
                            
                                Check if any (all) character of a string is in a given range
                            
                                How do I represent a Unicode character in a literal string ISO/ANSI C when the character set is ASCII?
                            
                                Python: How to move a file with unicode filename to a unicode folder
                            
                                Get the number of bytes needed for a Unicode string
                            
                                Replace unicode characters in PostgreSQL
                            
                                Converting \x escaped string to UTF-8 [duplicate]
                            
                                Issue with UTF-/ encoding on csv file for excel
                            
                                UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b
                            
                                Get emoji flag from country code in Ruby
                            
                                python - Problem storing Unicode character to MySQL with Django
                            
                                How do the new string types work in Delphi 2009/2010?
                            
                                How do I do a strtr on UTF-8 in PHP?
                            
                                Python unicode string with UTF-8?
                            
                                regex in Vietnamese characters
                            
                                How to solve a UnicodeDecodeError?
                            
                                How to convert unicode in JavaScript?
                            
                                How do I convert Unicode escape sequences to text in PHP?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to Remove Diacritic Marks (such as Accents) using Unicode Normalization in Standard SQL?

Tags:

unicode

google-bigquery

Willian Fuks

People also ask

1 Answers

The Short Answer

The More Complete Answer

Willian Fuks

Recent Activity

Donate For Us