Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I detect if have to apply UTF-8 decode or encode on a string?

Tags:

I have a feed taken from third-party sites, and sometimes I have to apply utf8_decode and other times utf8_encode to get the desired visible output.

If by mistake the same stuff is applied twice/or the wrong method is used I get something more ugly, this is what I want to change.

How can I detect when what have to apply on the string?

Actually the content returns UTF-8, but inside there are parts that are not.

like image 837
Pentium10 Avatar asked Dec 10 '10 10:12

Pentium10


People also ask

How do I check if a string is encoded?

So you can test if the string contains a colon, if not, urldecode it, and if that string contains a colon, the original string was url encoded, if not, check if the strings are different and if so, urldecode again and if not, it is not a valid URI.

How do I know if my text is UTF-8?

If it's a single byte UTF8 character, then it is always of form '0xxxxxxx', where 'x' is any binary digit. If it's a two byte UTF8 character, then it's always of form '110xxxxx10xxxxxx'.

Should I always use UTF-8?

When you need to write a program (performing string manipulations) that needs to be very very fast and that you're sure that you won't need exotic characters, may be UTF-8 is not the best idea. In every other situations, UTF-8 should be a standard. UTF-8 works well on almost every recent software, even on Windows.

How do I encode a string in UTF-8?

In order to convert a String into UTF-8, we use the getBytes() method in Java. The getBytes() method encodes a String into a sequence of bytes and returns a byte array. where charsetName is the specific charset by which the String is encoded into an array of bytes.


1 Answers

I can't say I can rely on mb_detect_encoding(). I had some freaky false positives a while back.

The most universal way I found to work well in every case was:

if (preg_match('!!u', $string)) {    // This is UTF-8 } else {    // Definitely not UTF-8 } 
like image 53
bisko Avatar answered Oct 19 '22 01:10

bisko