Scrape a webpage and navigate by clicking buttons

Tags:

I want to perform following actions at the server side:

1) Scrape a webpage
2) Simulate a click on that page and then navigate to the new page.
3) Scrape the new page
4) Simulate some button clicks on the new page
5) Sending the data back to the client via json or something

I am thinking of using it with Node.js.

But am confused as to which module should i use
a) Zombie
b) Node.io
c) Phantomjs
d) JSDOM
e) Anything else

I have installed node,io but am not able to run it via command prompt.

PS: I am working in windows 2008 server

359

asked Aug 10 '13 09:08

user2129794

2 Answers

Zombie.js and Node.io run on JSDOM, hence your options are either going with JSDOM (or any equivalent wrapper), a headless browser (PhantomJS, SlimerJS) or Cheerio.

JSDOM is fairly slow because it has to recreate DOM and CSSOM in Node.js.
PhantomJS/SlimerJS are proper headless browsers, thus performances are ok and those are also very reliable.
Cheerio is a lightweight alternative to JSDOM. It doesn't recreate the entire page in Node.js (it just downloads and parses the DOM - no javascript is executed). Therefore you can't really click on buttons/links, but it's very fast to scrape webpages.

Given your requirements, I'd probably go with something like a headless browser. In particular, I'd choose CasperJS because it has a nice and expressive API, it's fast and reliable (it doesn't need to reinvent the wheel on how to parse and render the dom or css like JSDOM does) and it's very easy to interact with elements such as buttons and links.

Your workflow in CasperJS should look more or less like this:

casper.start();  casper   .then(function(){     console.log("Start:");   })   .thenOpen("https://www.domain.com/page1")   .then(function(){     // scrape something     this.echo(this.getHTML('h1#foobar'));   })   .thenClick("#button1")   .then(function(){     // scrape something else     this.echo(this.getHTML('h2#foobar'));   })   .thenClick("#button2")   thenOpen("http://myserver.com", {     method: "post",     data: {         my: 'data',     }   }, function() {       this.echo("data sent back to the server")   });  casper.run();

107

answered Sep 22 '22 15:09

danielepolencic

Short answer (in 2019): Use puppeteer

If you need a full (headless) browser, use puppeteer instead of PhantomJS as it offers an up-to-date Chromium browser with a rich API to automate any browser crawling and scraping tasks. If you only want to parse a HTML document (without executing JavaScript inside the page) you should check out jsdom and cheerio.

Explanation

Tools like jsdom (or cheerio) allow it to extract information from a HTML document by parsing it. This is fast and works well as long as the website does not contain JavaScript. It will be very hard or even impossible to extract information from a website built on JavaScript. jsdom, for example, is able to execute scripts, but runs them inside a sandbox in your Node.js environment, which can be very dangerous and possibly crash your application. To quote the docs:

However, this is also highly dangerous when dealing with untrusted content.

Therefore, to reliably crawl more complex websites, you need an actual browser. For years, the most popular solution for this task was PhantomJS. But in 2018, the development of PhantomJS was offically suspended. Thankfully, since April 2017 the Google Chrome team makes it possible to run the Chrome browser headlessly (announcement). This makes it possible to crawl websites using an up-to-date browser with full JavaScript support.

To control the browser, the library puppeteer, which is also maintained by Google developers, offers a rich API for use within the Node.js environment.

Code sample

The lines below, show a simple example. It uses Promises and the async/await syntax to execute a number of tasks. First, the browser is started (puppeteer.launch) and a URL is opened page.goto. After that, a functions like page.evaluate and page.click are used to extract information and execute actions on the page. Finally, the browser is closed (browser.close).

const puppeteer = require('puppeteer');  (async () => {   const browser = await puppeteer.launch();   const page = await browser.newPage();    await page.goto('https://example.com');    // example: get innerHTML of an element   const someContent = await page.$eval('#selector', el => el.innerHTML);    // Use Promise.all to wait for two actions (navigation and click)   await Promise.all([     page.waitForNavigation(), // wait for navigation to happen     page.click('a.some-link'), // click link to cause navigation   ]);    // another example, this time using the evaluate function to return innerText of body   const moreContent = await page.evaluate(() => document.body.innerText);    // click another button   await page.click('#button');    // close brower when we are done   await browser.close(); })();

answered Sep 23 '22 15:09

Thomas Dondorf

Related questions
                            
                                Best way to check for mongoose validation error
                            
                                Is there a good object mapper for Amazons dynamodb(through aws sdk) which can be used in nodejs?
                            
                                How do I setup Babel 6 with Node JS to use ES6 in my Server Side code?
                            
                                How to fix MongoError: Cannot use a session that has ended
                            
                                callback() or return callback()
                            
                                Is there a NodeJS 'passthrough' stream?
                            
                                Why are my JS promise catch error objects empty?
                            
                                How to get the unparsed query string from a http request in Express
                            
                                what is the different between stat fstat and lstat functions in node js
                            
                                Jest No Tests found
                            
                                How to specify a port number for pm2
                            
                                does react really need nodeJS on the frontend ENV?
                            
                                node js cpu 100%
                            
                                how to set individual session maxAge in express?
                            
                                Clean way to wait for first true returned by Promise
                            
                                how to use async await with https post request
                            
                                How to log JavaScript objects and arrays in winston as console.log does?
                            
                                How to login in Puppeteer?
                            
                                How to create and manipulate promises in Protractor?
                            
                                Error: Handshake inactivity timeout in Node.js MYSQL module

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scrape a webpage and navigate by clicking buttons

Tags:

node.js

web-scraping

phantomjs

jsdom

zombie.js