A user on my site inputted special characters into a text field: ä ö
These apparently are not the same ä ö characters I can input from my keyboard because when I paste them into Programmer's Notepad, they split into two: a¨ o¨
On my site's server side I have a PHP script that identifies illegal special characters in user input and highligts them in an html error message with preg_replace
.
The character splitting happens there too so I get a normal letter a and o with a weird lone xCC character that breaks the UTF-8 string encoding and json_encode
function fails as a result.
What would be the best way to handle these characters? Should I try to replace the special ä ö chars and replace them with the regular ones or can I somehow catch the broken UTF-8 chars and remove or replace them?
Yes. 0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF are invalid UTF-8 code units.
UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes.
As for what encoding to use, Germans often use ISO/IEC 8859-15, but UTF-8 is increasingly becoming the norm, and can handle any kind of non-ASCII characters at the same time. UTF-8 is actually quite common in Germany now and can make all the difference when using German text.
It's not that these characters have broken the encoding, it's just that Unicode is really complicated.
Commonly used accented letters have their own code points in the Unicode standard, in this case:
However, to avoid encoding every possibility, particularly when multiple diacritics (accents) need to be placed on the same letter, Unicode includes "combining diacritics", such as:
When placed after the code point for a normal letter, these code points add a diacritic to it when displaying.
As you've seen, this means there's two different ways to represent the same letter. To help with this, Unicode includes "normalization forms" defined in an annex to the Unicode standard:
Ignoring the "Compatibility" forms for now, we have two options:
So one possibility is to convert your input into NFC, which in PHP can be achieved with the Normalizer
class in the intl
extension.
However, not all combinations can be normalised to a form with no separate diacritics, so this doesn't solve all your problems. You'll also need to look at what characters exactly you want to allow, probably by matching Unicode character properties.
You might also want to learn about "grapheme clusters" and use the relevant PHP functions. A "grapheme cluster", or just "grapheme", is what most readers will think of as "a character" - e.g. a letter with all its diacritics, or a full ideogram.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With