I'm having this problem with UTF8 string comparison which I really have no idea about and it starts to give me headache. Please help me out.
Basically I have this string from a xml document encoded in UTF8: 'Mina Tidigare anställningar'
And when I compare that string with the exactly the same string which I typed myself: 'Mina Tidigare anställningar' (also in UTF8). And the result is FALSE!!!
I have no idea why. It is so strange. Can someone help me out?
This seems somewhat relevant. To simplify, there are several ways to get the same text in Unicode (and therefore UTF8): for example, this: ř
can be written as one character ř
or as two characters: r
and the combining ˇ
.
Your best bet would be the normalizer class - normalize both strings to the same normalization form and compare the results.
In one of the comments, you show these hex representations of the strings:
4d696e61205469646967617265 20 616e7374 c3a4 6c6c6e696e676172 // from XML
4d696e61205469646967617265 c2a0 616e7374 61cc88 6c6c6e696e676172 // typed
^^-----------------^^^^1 ^^^^^^2
Note the parts I marked, apparently there are two parts to this problem.
For the first, observe this question on the meaning of byte sequence "c2a0" - for some reason, your typing is translated to a non-breakable space where the XML file has a normal space. Note that there's a normal space in both cases after "Mina". Not sure what to do about that in PHP, except to replace all whitespace with a normal space.
As to the second, that is the case I outlined above: c3a4
is ä
(U+00E4 "LATIN SMALL LETTER A WITH DIAERESIS" - one character, two bytes), whereas 61
is a
(U+0061 "LATIN SMALL LETTER A" - one character, one byte) and cc88
would be the combining umlaut "
(U+0308 "COMBINING DIAERESIS" - two characters, three bytes). Here, the normalization library should be useful.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With