Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When web scraping with Node.js, can I run all JavaScripts on the page? (i.e., simulate a real browser?)

I'm trying to do some web scraping with node.js. Using jsdom, it is easy to load up the DOM and inject JavaScript into it. I want to go one step further: run all JavaScript linked to from the web page and then inspect the resulting DOM, including visual properties (height, width, etc) of elements.

Thus far, I get NaN when I try to inspect the dimensions of DOM elements with jsdom.

Is this possible?

It strikes me that there are two distinct challenges:

  1. Running all the JS on the web page
  2. Getting Node to simulate the window/screen rendering in addition to just the DOM

Another way to ask the question: is it possible to use node.js as a completely headless browser that you can script?

If this isn't possible, does anyone have suggestions for what library I can use to do this? I'm relatively language agnostic.

like image 373
Ted Benson Avatar asked Oct 20 '11 21:10

Ted Benson


People also ask

Is NodeJS good for web scraping?

Web scraping is the process of extracting data from a website in an automated way and Node. js can be used for web scraping. Even though other languages and frameworks are more popular for web scraping, Node. js can be utilized well to do the job too.

Can you run NodeJS code in browser?

Thanks to some creative engineers, it is now feasible to use Node. js modules in browsers, but not directly. Being able to call Node. js modules from JavaScript running in the browser has many advantages because it allows you to use Node.

Does NodeJS run in the browser or server?

Both the browser and Node. js use JavaScript as their programming language. Building apps that run in the browser is a completely different thing than building a Node. js application.


1 Answers

Take a look at PhantomJS. Incredibly simple to use.

http://www.phantomjs.org/

PhantomJS is a command-line tool that packs and embeds WebKit. Literally it acts like any other WebKit-based web browser, except that nothing gets displayed to the screen (thus, the term headless). In addition to that, PhantomJS can be controlled or scripted using its JavaScript API.

like image 168
Dal Hundal Avatar answered Nov 09 '22 02:11

Dal Hundal