Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can one scrape a web page with jQuery and XPath?

I can stick a jQuery javascript link in the header of a web page via Firebug. Then, I can run a script to scrape it and the pages it links to.

How do I begin writing this script in jQuery or javascript in general? Is there an interface in either jQuery/Javascript with which I can use XPath to access the elements on a page (and on the pages it links to)?

like image 322
dangerChihuahua007 Avatar asked Dec 22 '22 01:12

dangerChihuahua007


1 Answers

First, you'll need a JavaScript runtime outside of the browser. The most common is Node.js. Next you'll need a way to create the DOM client-side. This is typically done using jsdom.

So, your script should:

  1. download the html page (jsdom does this for you, but you can use request)
  2. create a client-side DOM
  3. parse using jQuery

Here is a sample Node.js script:

var jsdom = require("jsdom");

jsdom.env("http://nodejs.org/dist/", [
    'http://code.jquery.com/jquery-1.5.min.js'
  ], function(errors, window) {
  console.log("there have been", window.$("a").length, "nodejs releases!");
});

You would run it, like so:

$ node scrape.js

Don't forget to install jsdom first:

$ npm install --production jsdom
like image 90
JP Richardson Avatar answered Jan 01 '23 11:01

JP Richardson