Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Anyone have a good solution for scraping the HTML source of a page with content (in this case, HTML tables) generated with Javascript? [closed]

Anyone have a good solution for scraping the HTML source of a page with content (in this case, HTML tables) generated with Javascript?

An embarrassingly simple, though workable solution using Crowbar:

<?php
function get_html($url) // $url must be urlencode(d)
{
$context = stream_context_create(array(
    'http' => array('timeout' => 120) // HTTP timeout in seconds
    ));
    $html = substr(file_get_contents('http://127.0.0.1:10000/?url=' . $url . '&delay=3000&view=browser', 0, $context), 730, -32); // substr removes HTML from the Crowbar web service, returning only the $url HTML
return $html;
}
?>

The advantage to using Crowbar is that the tables will be rendered (and accessible) thanks to the headless mozilla-based browser. Edit: discovered that the problem with Crowbar was a conflicting app, not the server downtime, which was just a coincidence.

like image 990
phpwns Avatar asked Oct 14 '22 05:10

phpwns


1 Answers

Well, Java provides some convenient solutions, like HtmlUint, which interprets correctly Javascript, and as a consequence should allow the generated html to be visible.

like image 196
Riduidel Avatar answered Oct 20 '22 12:10

Riduidel