Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP: Convert curl_exec output to UTF8

Tags:

php

unicode

utf-8

I would like to only work with UTF8. The problem is I don't know the charset of every webpage. How can I detect it and convert to UTF8?

<?php
$url = "http://vkontakte.ru";
$ch = curl_init($url);
$options = array(
    CURLOPT_RETURNTRANSFER => true,
);
curl_setopt_array($ch, $options);
$data = curl_exec($ch);

// $data = magic($data);

print $data;

See this at: http://paulisageek.com/tmp/curl-utf8

What is magic()?

like image 713
Paul Tarjan Avatar asked Mar 24 '10 19:03

Paul Tarjan


3 Answers

Going by Gumbo and Pekka's advice, I wrote curl_exec_utf8

/** The same as curl_exec except tries its best to convert the output to utf8 **/
function curl_exec_utf8($ch) {
    $data = curl_exec($ch);
    if (!is_string($data)) return $data;

    unset($charset);
    $content_type = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);

    /* 1: HTTP Content-Type: header */
    preg_match( '@([\w/+]+)(;\s*charset=(\S+))?@i', $content_type, $matches );
    if ( isset( $matches[3] ) )
        $charset = $matches[3];

    /* 2: <meta> element in the page */
    if (!isset($charset)) {
        preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s*charset=([^\s"]+))?@i', $data, $matches );
        if ( isset( $matches[3] ) ) {
            $charset = $matches[3];
            /* In case we want do do further processing downstream: */
            $data = preg_replace('@(<meta\s+http-equiv="Content-Type"\s+content="[\w/]+\s*;\s*charset=)([^\s"]+)@i', '$1utf-8', $data, 1);
        }
    }

    /* 3: <xml> element in the page */
    if (!isset($charset)) {
        preg_match( '@<\?xml.+encoding="([^\s"]+)@si', $data, $matches );
        if ( isset( $matches[1] ) ) {
            $charset = $matches[1];
            /* In case we want do do further processing downstream: */
            $data = preg_replace('@(<\?xml.+encoding=")([^\s"]+)@si', '$1utf-8', $data, 1);
        }
    }

    /* 4: PHP's heuristic detection */
    if (!isset($charset)) {
        $encoding = mb_detect_encoding($data);
        if ($encoding)
            $charset = $encoding;
    }

    /* 5: Default for HTML */
    if (!isset($charset)) {
        if (strstr($content_type, "text/html") === 0)
            $charset = "ISO 8859-1";
    }

    /* Convert it if it is anything but UTF-8 */
    /* You can change "UTF-8"  to "UTF-8//IGNORE" to 
       ignore conversion errors and still output something reasonable */
    if (isset($charset) && strtoupper($charset) != "UTF-8")
        $data = iconv($charset, 'UTF-8', $data);

    return $data;
}

The regexes are mostly from http://nadeausoftware.com/articles/2007/06/php_tip_how_get_web_page_content_type

like image 172
Paul Tarjan Avatar answered Oct 17 '22 17:10

Paul Tarjan


The converting is easy. The detecting is the hard part. You could try mb_detect_encoding but that is a very shaky method, it's literally "guessing" the content type and as @troelskn highlights in the comments can guess "rough" differences at best (Is it a multi-byte encoding?) but fails at detecting nuances of similar character sets.

The proper way would be IMO:

  • Interpreting any content-type Meta tags in the page
  • Interpreting any content-type headers sent by the server
  • If that yields nothing, try to "sniff" the encoding using mb_detect_encoding()
  • If that yields nothing, fall back to a defined default (maybe ISO-8859-1, maybe UTF-8).

Different than outlined in the guidelines in @Gumbo's answer, I personally think Meta tags should have priority over server headers because I'm pretty sure that if a Meta tag is present, that is a more reliable indicator of the actual encoding of the page than a server setting some site operators don't even know how to change. The correct way, however, seems to be to treat content-type headers with higher priority.

For the former, I think you can use get_meta_tags(). The latter you should be getting from curl already, you would just have to parse it. Here is a full example on how to systematically process response headers served by cURL.

The conversion would then be using iconv:

$new_content = iconv("incoming-charset", "utf-8", $content);
like image 23
Pekka Avatar answered Oct 17 '22 19:10

Pekka


I was extremely happy to find this answer, but noticed there's a flaw in the <meta> tag detection. It simply didn't seem to match any content-type tags, and it's not yet equipped for the new HTML5 style tags: <meta charset="UTF-8">. So I wrote this, hope it helps you guys, and thanks again for this excellent solution!

/* 2: <meta> element in the page */
if (!isset($charset)) {
    preg_match('/<[\s]*meta[^>]*charset="?([^\s"]+)\s?"/i', $data, $matches);

    if (isset($matches[1])) {
        $charset = $matches[1];
    }
}

(P.S. I couldn't figure out how to post this as a comment, as it's obviously not a full answer.)

like image 1
oscaralexander Avatar answered Oct 17 '22 19:10

oscaralexander