What's a good was to scrape website content using Node.js. I'd like to build something very, very fast that can execute searches in the style of kayak.com, where one query is dispatched to several different sites, the results scraped, and returned to the client as they become available.
Let's assume that this script should just provide the results in JSON format, and we can process them either directly in the browser or in another web application.
A few starting points:
Using node.js and jquery to scrape websites
Anybody have any ideas?
Web scraping is the process of extracting data from a website in an automated way and Node. js can be used for web scraping. Even though other languages and frameworks are more popular for web scraping, Node. js can be utilized well to do the job too.
In general, real-time data scraping is the process through which software scrapes data from websites at almost the same time as changes occur there. This process requires a delicate approach. To get data almost at once, your software needs to request the web sources many times.
All aforementioned solutions presume running the scraper locally. This means you will be severely limited in performance (due to running them in sequence or in a limited set of threads). A better approach, imho, is to rely on an existing, albeit commercial, scraping grid.
Here is an example:
var bobik = new Bobik("YOUR_AUTH_TOKEN"); bobik.scrape({ urls: ['amazon.com', 'zynga.com', 'http://finance.google.com/', 'http://shopping.yahoo.com'], queries: ["//th", "//img/@src", "return document.title", "return $('script').length", "#logo", ".logo"] }, function (scraped_data) { if (!scraped_data) { console.log("Data is unavailable"); return; } var scraped_urls = Object.keys(scraped_data); for (var url in scraped_urls) console.log("Results from " + url + ": " + scraped_data[scraped_urls[url]]); });
Here, scraping is performed remotely and a callback is issued to your code only when results are ready (there is also an option to collect results as they become available).
You can download Bobik client proxy SDK at https://github.com/emirkin/bobik_javascript_sdk
Node.io seems to take the cake :-)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With