Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP Scrape Article Excerpt like Readability

I've seen this question, but it doesn't really satisfy what I'm looking for. That question's answers were either: lift from the meta description tag, and the second was generating an excerpt for an article you already have the body from.

What I want to do is actually get the first few sentences of an article, like Readability does. What't the best method for this? HTML Parsing? Here's what I'm currently using, but this is not very reliable.

function guessExcerpt($url) {
    $html = file_get_contents_curl($url);

    $doc = new DOMDocument();
    @$doc->loadHTML($html);

    $metas = $doc->getElementsByTagName('meta');

    for ($i = 0; $i < $metas->length; $i++)
    {
        $meta = $metas->item($i);
        if($meta->getAttribute('name') == 'description')
            $description = $meta->getAttribute('content');

    }

    return $description;
}

function file_get_contents_curl($url) {
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_TIMEOUT, 5);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

    $data = curl_exec($ch);
    curl_close($ch);

    return $data;
}
like image 307
Alfo Avatar asked Jul 30 '12 16:07

Alfo


1 Answers

Here is a port of Readability in PHP: https://github.com/andreskrey/readability.php. Just try it. The extraction result will be similar to Readability (because it implements Readability's algorithm).

require 'lib/Readability.inc.php';

$html = file_get_contents_curl($url);

$Readability     = new Readability($html, $html_input_charset); // default charset is utf-8
$ReadabilityData = $Readability->getContent();

$title   = $ReadabilityData['title'];
$content = $ReadabilityData['content'];

Then you can use some sentences from $content as the excerpt.

like image 77
Muhammad Abrar Avatar answered Sep 21 '22 00:09

Muhammad Abrar