Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to perform a background load and scraping of a page with XUL/Firefox Extension

I want to scrape the user pages of SO to give the owners of my toolbar the updated information on their questions/answers/etc...

This means I need to do this in the background, parse the pages, extract the content, compare it with the last run and then present the results either on the toolbar or the status bar, or alternatively, on a pop-up window of some kind. And all of this has to be done while the user is going about his business not being interrupted or even being on SO.

I've searched quite thoroughly both on Google and on the Mozilla Wiki for some kind of hint. I've even gone to the extent of downloading a few other extensions that I think do the same. Unfortunately I've not had the time to go through all of them and the ones I've looked at, all use data APIs(Services, WebServices, XML), not html scrapping.

Old question text

I'm looking for a nice place to learn how I can load a page inside a function called buy the infamous set_timeout() to process a screen-scraping in the background.

My idea is to present the results of such scraping in a status bar extension, just in case any thing changed from the last run.

Is there a hidden overlay or some other subterfuge?

like image 527
Gustavo Carreno Avatar asked Dec 27 '08 16:12

Gustavo Carreno


People also ask

How do I use Webscraper in Chrome?

To start the scraping process, just click on the sitemap tab and select 'Scrape'. A new window will pop up which will visit each page in the loop and crawl the required data. If you want to stop the data scraping process in between, just close this window and you will have the data that was extracted till then.

What type of data web scraping is used to extract?

Web scraping is the process of using bots to extract content and data from a website. Unlike screen scraping, which only copies pixels displayed onscreen, web scraping extracts underlying HTML code and, with it, data stored in a database. The scraper can then replicate entire website content elsewhere.


1 Answers

In case of XUL/Firefox, what you need is the nsIIOService interface, which you can get like this:

var mIOS = Components.classes["@mozilla.org/network/io-service;1"].
   getService(Components.interfaces.nsIIOService);

Then you need to create a channel, and open an asynchronous link:

var channel = mIOS.newChannel(urlToOpen, 0, null);
channel.asyncOpen(new StreamListener(), channel);

The key here is the StreamListener object:

var StreamListener = function() {
    return {
        QueryInterface: function(aIID) {
            if (aIID.equals(Components.interfaces.nsIStreamListener) ||
                aIID.equals(Components.interfaces.nsISupportsWeakReference) ||
                aIID.equals(Components.interfaces.nsISupports))
                return this;
            throw Components.results.NS_NOINTERFACE;

        onStartRequest: function(aRequest, aContext)
           { return 0; },

        onStopRequest: function(aRequest, aChannel /* aContext */, aStatusCode)
           { return 9; },

        onDataAvailable: function(aRequest, aContext, aStream, aOffset, aCount)
           { return 0; }
    };
}

You have to fill in the details in the onStartRequest, onStopRequest, onDataAvailable functions, but that should be enough to get you going. You can have a look at how I used this interface in my Firefox extension (it is called IdentFavIcon, and it can be found on the mozilla add-ons site).

The part which I'm uncertain about is how you can trigger this page request from time to time, set_timeout() should probably work, though.

Edit:

  1. See example here (see section Downloading Images) for an example on how to collect downloaded data into a single variable; and
  2. See this page on how to convert an HTML source into a DOM tree.

HTH.

like image 187
David Hanak Avatar answered Oct 31 '22 19:10

David Hanak