Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Node.js Saving a GET request's HTML response

I'm apparently a little newer to Javascript than I'd care to admit. I'm trying to pull a webpage using Node.js and save the contents as a variable, so I can parse it however I feel like.

In Python, I would do this:

from bs4 import BeautifulSoup # for parsing
import urllib

text = urllib.urlopen("http://www.myawesomepage.com/").read()

parse_my_awesome_html(text)

How would I do this in Node? I've gotten as far as:

var request = require("request");
request("http://www.myawesomepage.com/", function (error, response, body) {
    /*
     Something here that lets me access the text
     outside of the closure

     This doesn't work:
     this.text = body;
    */ 
})
like image 696
jdotjdot Avatar asked Jul 07 '12 00:07

jdotjdot


3 Answers

var request = require("request");

var parseMyAwesomeHtml = function(html) {
    //Have at it
};

request("http://www.myawesomepage.com/", function (error, response, body) {
    if (!error) {
        parseMyAwesomeHtml(body);
    } else {
        console.log(error);
    }
});

Edit: As Kishore noted, there are nice options for parsing available. Also see cheerio if you have python/gyp issues with jsdom on windows. Cheerio on github

like image 181
Steve McGuire Avatar answered Nov 17 '22 21:11

Steve McGuire


That request() call is asynchronous, so the response is only available inside the callback. You have to call your parse function from it:

function parse_my_awesome_html(text){
    ...
}

request("http://www.myawesomepage.com/", function (error, response, body) {
    parse_my_awesome_html(body)
})

Get used to chaining callbacks, that's essentially how any I/O will happen in javascript :)

like image 22
Ricardo Tomasi Avatar answered Nov 17 '22 20:11

Ricardo Tomasi


JsDom is pretty good to achieve things like this if you want to parse the response.

    var request = require('request'),
    jsdom = require('jsdom');

request({ uri:'http://www.myawesomepage.com/' }, function (error, response, body) {
  if (error && response.statusCode !== 200) {
    console.log('Error when contacting myawesomepage.com')
  }

  jsdom.env({
    html: body,
    scripts: [
      'http://code.jquery.com/jquery-1.5.min.js'
    ]
  }, function (err, window) {
    var $ = window.jQuery;

    // jQuery is now loaded on the jsdom window created from 'agent.body'
    console.log($('body').html());
  });
});

also if your page has lot of javascript/ajax content being loaded you might want to consider using phantomjs Source http://blog.nodejitsu.com/jsdom-jquery-in-5-lines-on-nodejs/

like image 31
Kishore Avatar answered Nov 17 '22 21:11

Kishore