Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to decode such strange string to UTF-8? (PHP)

So I have %u041E%u043B%u0435%u0433%20%u042F%u043A how to save it into real UTF-8 or (better for me to HTML entities)?

like image 368
Rella Avatar asked Dec 18 '22 00:12

Rella


1 Answers

That's JavaScript escape() format. It is similar to URL-encoding but not compatible. Using it at all is usually a mistake.

The best thing to do is to change the script that generates it, to use proper URL-encoding (encodeURIComponent()) instead. Then you can decode it with urldecode or any other normal URL-decoding function on the server side.

If you absolutely must interchange data in this non-standard format, you'll have to write a custom decoder for it. Here's a quick hack leveraging the HTML character-reference-decoder:

function jsunescape($s) {
    $s= preg_replace('/%u(....)/', '&#x$1;', $s);
    $s= preg_replace('/%(..)/', '&#x$1;', $s);
    return html_entity_decode($s, ENT_COMPAT, 'utf-8');
}

This returns a raw UTF-8 byte string. If you really want it in HTML character references like Ру... then leave off the html_entity_decode call. But normally you don't. Best to keep strings in raw format until they need to be escaped for final output — and best not to replace non-ASCII characters with character references at all unless you really need to.

what If some string like this will come to me ' %CE%EB%E5%E3+%DF%EA%F3%F8%EA%E8%ED'

That's URL-form-encoded, which is not directly compatible with escape() format. Whilst URL-encoding's 2-digit byte escapes are different from the crazy escape-format 4-digit code-unit-escapes, the character + is ambiguous. It could mean a plus (if the string came from escape), or a space (if it came from a browser form submission). There is no way to tell which it is. This is another reason not to use escape().

Apart from that; if the charset of this string were UTF-8 then yes, the above function would be fine, converting both the URL-encoded bytes and the crazy escape()-format Unicode characters into raw UTF-8 bytes.

However it actually appears to be code page 1251 (Windows Russian). Do you really want to handle all your strings in cp1251? If so, you would have to change it a bit to make it encode the four-digit escapes into a different charset. This is messy:

function url_or_maybe_jsescape_decode($s, $charset, $isform) {
    if ($isform)
        $s= str_replace('+', ' ', $s);
    $s= preg_replace('/%u(....)/', '&#x$1;', $s);
    $s= preg_replace('/%(..)/', '&!#x$1;', $s);
    $s= html_entity_decode($s, ENT_COMPAT, $charset);
    $s= str_replace('&!', '&', $s);
    $s= html_entity_decode($s, ENT_COMPAT, 'utf-8');
    return $s;
}

echo url_or_maybe_jsescape_decode('%CE%EB%E5%E3+%DF%EA%F3%F8%EA%E8%ED', 'cp1251', TRUE);

I would strongly recommend:

  1. fixing the Flash file so that it uses proper encodeURIComponent and not escape, so you can use a standard URL-decoder instead of this ugly hack.

  2. using UTF-8 instead all the way through your application, so you can support languages other than just Russian, and you don't have to worry about the input encoding of submitted forms changing.

(All encodings that are not UTF-8 suck, and that's a FACT proven by SCIENCE!)

like image 181
bobince Avatar answered Feb 25 '23 21:02

bobince