What's a good was to scrape website content using Node.js. I'd like to build something very, very fast that can execute searches in the style of kayak.com, where one query is dispatched to several different sites, the results scraped, and returned to the client as they become available. Let's assume that this script should just provide the results in JSON format, and we can process them either directly in the browser or in another web application. A few starting points: Using node.js and jquery to scrape websites Anybody have any ideas?

Node.io seems to take the cake :-)

Scrape web pages in real time with Node.js

Tags:

What's a good was to scrape website content using Node.js. I'd like to build something very, very fast that can execute searches in the style of kayak.com, where one query is dispatched to several different sites, the results scraped, and returned to the client as they become available.

Let's assume that this script should just provide the results in JSON format, and we can process them either directly in the browser or in another web application.

A few starting points:

Using node.js and jquery to scrape websites

Anybody have any ideas?

487

asked Mar 06 '11 15:03

Avishai

2 Answers

All aforementioned solutions presume running the scraper locally. This means you will be severely limited in performance (due to running them in sequence or in a limited set of threads). A better approach, imho, is to rely on an existing, albeit commercial, scraping grid.

Here is an example:

var bobik = new Bobik("YOUR_AUTH_TOKEN"); bobik.scrape({   urls: ['amazon.com', 'zynga.com', 'http://finance.google.com/', 'http://shopping.yahoo.com'],   queries:  ["//th", "//img/@src", "return document.title", "return $('script').length", "#logo", ".logo"] }, function (scraped_data) {   if (!scraped_data) {     console.log("Data is unavailable");     return;   }   var scraped_urls = Object.keys(scraped_data);   for (var url in scraped_urls)     console.log("Results from " + url + ": " + scraped_data[scraped_urls[url]]); });

Here, scraping is performed remotely and a callback is issued to your code only when results are ready (there is also an option to collect results as they become available).

You can download Bobik client proxy SDK at https://github.com/emirkin/bobik_javascript_sdk

answered Sep 20 '22 09:09

Yevgeniy

Node.io seems to take the cake :-)

answered Sep 20 '22 09:09

Avishai

Related questions
                            
                                node.js - how to write an array to file
                            
                                Charts.js graph not scaling to canvas size
                            
                                Embedding youtube video "Refused to display document because display forbidden by X-Frame-Options"
                            
                                bootstrap datepicker today as default
                            
                                In nodeJs is there a way to loop through an array without using array size?
                            
                                How do I add target="_blank" to a link within a specified div?
                            
                                Error handling in AngularJS http get then construct
                            
                                Javascript replace all "%20" with a space
                            
                                How to create a sleep/delay in nodejs that is Blocking?
                            
                                Conflict on Template of Twig and Vue.js
                            
                                Amazon Cognito "A client attempted to write unauthorized attribute"
                            
                                How can I use an AngularJS filter to format a number to have leading zeros?
                            
                                How do I round a number in JavaScript?
                            
                                Avoid having to double-click to toggle Bootstrap dropdown
                            
                                How do I add a separator between elements in an {{#each}} loop except after the last element?
                            
                                Number prime test in JavaScript
                            
                                js function to get filename from url
                            
                                What are alternatives to ExtJS?
                            
                                How do I use MS-XCEP and MS-WSTEP in .NET or JavaScript to get a certificate from AD CS?
                            
                                node.js multi room chat example

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scrape web pages in real time with Node.js

Tags:

javascript

jquery

node.js

web-scraping

screen-scraping

Avishai

People also ask

2 Answers

Yevgeniy

Avishai

Recent Activity

Donate For Us