A user on my site inputted special characters into a text field: ä ö These apparently are not the same ä ö characters I can input from my keyboard because when I paste them into Programmer's Notepad, they split into two: a¨ o¨ On my site's server side I have a PHP script that identifies illegal special characters in user input and highligts them in an html error message with <code>preg_replace</code>. The character splitting happens there too so I get a normal letter a and o with a weird lone xCC character that breaks the UTF-8 string encoding and <code>json_encode</code> function fails as a result. What would be the best way to handle these characters? Should I try to replace the special ä ö chars and replace them with the regular ones or can I somehow catch the broken UTF-8 chars and remove or replace them?

It's not that these characters have broken the encoding, it's just that Unicode is really complicated. Commonly used accented letters have their own code points in the Unicode standard, in this case: <ul> <li>U+00E4 "LATIN SMALL LETTER A WITH DIAERESIS"</li> <li>U+00F6 "LATIN SMALL LETTER O WITH DIAERESIS"</li> </ul> However, to avoid encoding every possibility, particularly when multiple diacritics (accents) need to be placed on the same letter, Unicode includes "combining diacritics", such as: <ul> <li>U+0308 "COMBINING DIAERESIS"</li> </ul> When placed after the code point for a normal letter, these code points add a diacritic to it when displaying. As you've seen, this means there's two different ways to represent the same letter. To help with this, Unicode includes "normalization forms" defined in an annex to the Unicode standard: <ul> <li>Normalization Form D (NFD): Canonical Decomposition</li> <li>Normalization Form C (NFC): Canonical Decomposition, followed by Canonical Composition</li> <li>Normalization Form KD (NFKD): Compatibility Decomposition</li> <li>Normalization Form KC (NFKC): Compatibility Decomposition, followed by Canonical Composition</li> </ul> Ignoring the "Compatibility" forms for now, we have two options: <ul> <li>Decomposition, which uses combining diacritics as often as possible</li> <li>Composition, which uses specific code points as often as possible</li> </ul> So one possibility is to convert your input into NFC, which in PHP can be achieved with the <code>Normalizer</code> class in the <code>intl</code> extension. However, not all combinations can be normalised to a form with no separate diacritics, so this doesn't solve all your problems. You'll also need to look at what characters exactly you want to allow, probably by matching Unicode character properties. You might also want to learn about "grapheme clusters" and use the relevant PHP functions. A "grapheme cluster", or just "grapheme", is what most readers will think of as "a character" - e.g. a letter with all its diacritics, or a full ideogram.

Special ä ö characters break UTF-8 encoding

Tags:

php

encoding

utf-8

A user on my site inputted special characters into a text field: ä ö

These apparently are not the same ä ö characters I can input from my keyboard because when I paste them into Programmer's Notepad, they split into two: a¨ o¨

On my site's server side I have a PHP script that identifies illegal special characters in user input and highligts them in an html error message with preg_replace.

The character splitting happens there too so I get a normal letter a and o with a weird lone xCC character that breaks the UTF-8 string encoding and json_encode function fails as a result.

What would be the best way to handle these characters? Should I try to replace the special ä ö chars and replace them with the regular ones or can I somehow catch the broken UTF-8 chars and remove or replace them?

265

asked Feb 28 '19 13:02

Corrodian

1 Answers

It's not that these characters have broken the encoding, it's just that Unicode is really complicated.

Commonly used accented letters have their own code points in the Unicode standard, in this case:

U+00E4 "LATIN SMALL LETTER A WITH DIAERESIS"
U+00F6 "LATIN SMALL LETTER O WITH DIAERESIS"

However, to avoid encoding every possibility, particularly when multiple diacritics (accents) need to be placed on the same letter, Unicode includes "combining diacritics", such as:

U+0308 "COMBINING DIAERESIS"

When placed after the code point for a normal letter, these code points add a diacritic to it when displaying.

As you've seen, this means there's two different ways to represent the same letter. To help with this, Unicode includes "normalization forms" defined in an annex to the Unicode standard:

Normalization Form D (NFD): Canonical Decomposition
Normalization Form C (NFC): Canonical Decomposition, followed by Canonical Composition
Normalization Form KD (NFKD): Compatibility Decomposition
Normalization Form KC (NFKC): Compatibility Decomposition, followed by Canonical Composition

Ignoring the "Compatibility" forms for now, we have two options:

Decomposition, which uses combining diacritics as often as possible
Composition, which uses specific code points as often as possible

So one possibility is to convert your input into NFC, which in PHP can be achieved with the Normalizer class in the intl extension.

However, not all combinations can be normalised to a form with no separate diacritics, so this doesn't solve all your problems. You'll also need to look at what characters exactly you want to allow, probably by matching Unicode character properties.

You might also want to learn about "grapheme clusters" and use the relevant PHP functions. A "grapheme cluster", or just "grapheme", is what most readers will think of as "a character" - e.g. a letter with all its diacritics, or a full ideogram.

162

answered Oct 01 '22 00:10

IMSoP

Related questions
                            
                                Laravel validate the field on the basis of value of another field
                            
                                PHP Internals: How does TSRMLS_FETCH Work?
                            
                                How can I debug AWS Cloudfront signed URL access denied?
                            
                                PHP sort array with special characters
                            
                                PHP - sort array by array key [duplicate]
                            
                                Why doesn't PHP's null coalescing operator (??) work on class constants with different visibilities?
                            
                                Laravel Dusk - Reuse browser with its session and cookies
                            
                                Make checkout fields required in Woocommerce checkout
                            
                                How to str_split string from right to left?
                            
                                How to configure MAILER_URL in .env file of Symfony 4 to send e-mails via sendmail with Swift_Mailer?
                            
                                Show or hide html element on chosen shipping method change in Woocommerce
                            
                                Woocommerce: Invalid argument supplied for foreach() in class-wc-product-variable.php file
                            
                                PHP/json_encode: dealing with mixed arrays and objects with numeric properties
                            
                                What is the PHP equivalent of JavaScript spread syntax for constructing an array?
                            
                                Segmentation fault during Laravel Migration
                            
                                Refresh cached shipping methods on checkout update ajax event in Woocommerce
                            
                                Error in phpmyadmin - `mysqli extension is missing`
                            
                                Set end of day instead of 00:00:00 in laravel
                            
                                My laravel project start page is not opening
                            
                                Does PHP have a string function like Python's f-string function? (not str.format())

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With