Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting JavaScript Variable Values via Web Scraping

For a company project, I need to create a web scraping application with PHP and JavaScript (including jQuery) that will extract specific data from each page of our clients' websites. The scraping app needs to get two types of data for each page: 1) determine whether certain HTML elements with specific IDs are present, and 2) extract the value of a specific JavaScript variable. The JS variable name is the same on each page, but the value is usually different.

I believe I know how I can get the first data requirement: using the PHP file_get_contents() function to get each page's HTML and then use JavaScript/jQuery to parse that HTML and search for elements with specific IDs. However, I'm not sure how to get the 2nd piece of data - the JavaScript variable values. The JavaScript variable isn't even found within each page's HTML; instead, it is found in an external JavaScript file that is linked to the page. And even if the JavaScript were embedded in the page's HTML, I know that file_get_contents() would only extract the JavaScript code (and other HTML) and not any variable values.

Can anyone suggest a good approach to getting this variable value for each page of a given website?

EDIT: Just to clarify, I need the values of the JavaScript variables after the JavaScript code has been run. Is such a thing even possible?

like image 513
jake Avatar asked May 10 '11 14:05

jake


2 Answers

You say you need the value of the variable after the JS has executed. I assume it's always the same JS, with just initial variable values being the thing that changes. Your best bet is to port the JS to PHP, which lets you extract the initial JS variable values and then pretend you executed the JS.

Here's a function for extracting variable values from JavaScript:


/**
 * extracts a variable value given its name and type. makes certain assumptions about the source,
 * i.e. can't handle strings with escaped quotes.
 * 
 * @param string $jsText    the JavaScript source
 * @param string $name      the name of the variable
 * @param string $type      the variable type, either 'string' (default), 'float' or 'int'
 * @return string|int|float           the extracted variable value
 */
function extractVar($jsText, $name, $type = 'string') {
    if ($type == 'string') {
        $valueMatch = "(\"|')(.*?)(\"|')";
    } else {
        $valueMatch = "([0-9.]+?)";
    }

    preg_match("/$name\s*\=\s*$valueMatch/", $jsText, $matches);
    if ($type == 'string') {
        return $matches[2];
    } else if ($type == 'float') {
        return (float)$matches[1];
    } else if ($type == 'int') {
        return (int)$matches[1];
    } else {
        return false;
    }
}
like image 52
mwhite Avatar answered Sep 22 '22 03:09

mwhite


presumably this is impossible because it seems so simple, but if it's your .js you're trying to detect, why not just have that .js do something detectable via scrape to the page?

use the js to populate a tag like this somewhere (via element.innerHTML, presumably):

<span><!--Important js thing has been activated!--></span>.   

edit: alternately, maybe use a document.write, if the script needs to be detectable onload

like image 20
Trass Vasston Avatar answered Sep 20 '22 03:09

Trass Vasston