I have a feed taken from third-party sites, and sometimes I have to apply utf8_decode
and other times utf8_encode
to get the desired visible output.
If by mistake the same stuff is applied twice/or the wrong method is used I get something more ugly, this is what I want to change.
How can I detect when what have to apply on the string?
Actually the content returns UTF-8, but inside there are parts that are not.
So you can test if the string contains a colon, if not, urldecode it, and if that string contains a colon, the original string was url encoded, if not, check if the strings are different and if so, urldecode again and if not, it is not a valid URI.
If it's a single byte UTF8 character, then it is always of form '0xxxxxxx', where 'x' is any binary digit. If it's a two byte UTF8 character, then it's always of form '110xxxxx10xxxxxx'.
When you need to write a program (performing string manipulations) that needs to be very very fast and that you're sure that you won't need exotic characters, may be UTF-8 is not the best idea. In every other situations, UTF-8 should be a standard. UTF-8 works well on almost every recent software, even on Windows.
In order to convert a String into UTF-8, we use the getBytes() method in Java. The getBytes() method encodes a String into a sequence of bytes and returns a byte array. where charsetName is the specific charset by which the String is encoded into an array of bytes.
I can't say I can rely on mb_detect_encoding()
. I had some freaky false positives a while back.
The most universal way I found to work well in every case was:
if (preg_match('!!u', $string)) { // This is UTF-8 } else { // Definitely not UTF-8 }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With