I want to scrape the user pages of SO to give the owners of my toolbar the updated information on their questions/answers/etc...
This means I need to do this in the background, parse the pages, extract the content, compare it with the last run and then present the results either on the toolbar or the status bar, or alternatively, on a pop-up window of some kind. And all of this has to be done while the user is going about his business not being interrupted or even being on SO.
I've searched quite thoroughly both on Google and on the Mozilla Wiki for some kind of hint. I've even gone to the extent of downloading a few other extensions that I think do the same. Unfortunately I've not had the time to go through all of them and the ones I've looked at, all use data APIs(Services, WebServices, XML), not html scrapping.
Old question text
I'm looking for a nice place to learn how I can load a page inside a function called buy the infamous set_timeout() to process a screen-scraping in the background.
My idea is to present the results of such scraping in a status bar extension, just in case any thing changed from the last run.
Is there a hidden overlay or some other subterfuge?
To start the scraping process, just click on the sitemap tab and select 'Scrape'. A new window will pop up which will visit each page in the loop and crawl the required data. If you want to stop the data scraping process in between, just close this window and you will have the data that was extracted till then.
Web scraping is the process of using bots to extract content and data from a website. Unlike screen scraping, which only copies pixels displayed onscreen, web scraping extracts underlying HTML code and, with it, data stored in a database. The scraper can then replicate entire website content elsewhere.
In case of XUL/Firefox, what you need is the nsIIOService interface, which you can get like this:
var mIOS = Components.classes["@mozilla.org/network/io-service;1"].
getService(Components.interfaces.nsIIOService);
Then you need to create a channel, and open an asynchronous link:
var channel = mIOS.newChannel(urlToOpen, 0, null);
channel.asyncOpen(new StreamListener(), channel);
The key here is the StreamListener
object:
var StreamListener = function() {
return {
QueryInterface: function(aIID) {
if (aIID.equals(Components.interfaces.nsIStreamListener) ||
aIID.equals(Components.interfaces.nsISupportsWeakReference) ||
aIID.equals(Components.interfaces.nsISupports))
return this;
throw Components.results.NS_NOINTERFACE;
onStartRequest: function(aRequest, aContext)
{ return 0; },
onStopRequest: function(aRequest, aChannel /* aContext */, aStatusCode)
{ return 9; },
onDataAvailable: function(aRequest, aContext, aStream, aOffset, aCount)
{ return 0; }
};
}
You have to fill in the details in the onStartRequest
, onStopRequest
, onDataAvailable
functions, but that should be enough to get you going. You can have a look at how I used this interface in my Firefox extension (it is called IdentFavIcon, and it can be found on the mozilla add-ons site).
The part which I'm uncertain about is how you can trigger this page request from time to time, set_timeout()
should probably work, though.
Edit:
HTH.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With