Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrape web pages in real time with Node.js

What's a good was to scrape website content using Node.js. I'd like to build something very, very fast that can execute searches in the style of kayak.com, where one query is dispatched to several different sites, the results scraped, and returned to the client as they become available.

Let's assume that this script should just provide the results in JSON format, and we can process them either directly in the browser or in another web application.

A few starting points:

Using node.js and jquery to scrape websites

Anybody have any ideas?

like image 487
Avishai Avatar asked Mar 06 '11 15:03

Avishai


People also ask

Is NodeJS good for web scraping?

Web scraping is the process of extracting data from a website in an automated way and Node. js can be used for web scraping. Even though other languages and frameworks are more popular for web scraping, Node. js can be utilized well to do the job too.

Is web scraping real time?

In general, real-time data scraping is the process through which software scrapes data from websites at almost the same time as changes occur there. This process requires a delicate approach. To get data almost at once, your software needs to request the web sources many times.


2 Answers

All aforementioned solutions presume running the scraper locally. This means you will be severely limited in performance (due to running them in sequence or in a limited set of threads). A better approach, imho, is to rely on an existing, albeit commercial, scraping grid.

Here is an example:

var bobik = new Bobik("YOUR_AUTH_TOKEN"); bobik.scrape({   urls: ['amazon.com', 'zynga.com', 'http://finance.google.com/', 'http://shopping.yahoo.com'],   queries:  ["//th", "//img/@src", "return document.title", "return $('script').length", "#logo", ".logo"] }, function (scraped_data) {   if (!scraped_data) {     console.log("Data is unavailable");     return;   }   var scraped_urls = Object.keys(scraped_data);   for (var url in scraped_urls)     console.log("Results from " + url + ": " + scraped_data[scraped_urls[url]]); }); 

Here, scraping is performed remotely and a callback is issued to your code only when results are ready (there is also an option to collect results as they become available).

You can download Bobik client proxy SDK at https://github.com/emirkin/bobik_javascript_sdk

like image 42
Yevgeniy Avatar answered Sep 20 '22 09:09

Yevgeniy


Node.io seems to take the cake :-)

like image 74
Avishai Avatar answered Sep 20 '22 09:09

Avishai