Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using webkit for headless browsing

I am using webkit based tools to built a headless browser for crawling webpages (I need this because I would like to evaluate the javascript found on pages and fetch the final rendered page). But, the two different systems I have implemented so far exhibit very poor performance. I have implemented two different systems, both of which use webkit as the backend:

  1. Using Google Chrome: I would start Google Chrome and communicate with each tab using webSockets exposed by Chrome for remote debugging (debugging over wire). This way I can control each tab, load a new page and once the page is loaded I fetch the DOM of the loaded webpage.
  2. Using phantomjs: phantomjs uses webkit to load pages and provides a headless browsing option. As explained in the examples of phantomjs, I use page.open to open a new URL and then fetch the dom once the page is loaded by evaluating javascript on the page.

My goal is to crawl pages as fast as I can and if the page does not load in the first 10 seconds, declare it failed and move on. I understand that each page takes a while to load, so to increase the number of pages I load per second, I open many tabs in Chrome or start multiple parallel processes using phantomjs. The following is the performance that I observe:

  1. If I open more than 20 tabs in Chrome / 20 phantomjs instances, the CPU usage rockets up.
  2. Due to the high CPU usage, a lot of pages take more than 10seconds to load and hence I have a higher failure rate (~80% of page load requests failing)
  3. If I intend to keep the fails to less than 5% of the total requests, I cannot load more than 1 URL per second.

After trying out both the webkit based systems, it feels like the performance bottleneck is the webkit rendering engine and hence would like to understand from other users here, the number of URLs per second that I can expect to crawl. My hardware configuration is:

  1. Processor: Intel® Core™ i7-2635QM (1 processor, 4 cores)
  2. Graphics card: AMD Radeon HD 6490M (256MB)
  3. Memory: 4GB
  4. Network bandwidth is good enough to be able to load pages more than the performance that I am observing

The question I am trying to ask this mailing list is, does any one have experience using webkit for crawling web pages for a random set of URLs (say picking 10k URLs from twitter stream), how many URLs can I reasonably expect to crawl per second?

Thanks

like image 483
Md Sami Avatar asked Jul 11 '12 20:07

Md Sami


1 Answers

This question is actually more related to hardware than the software but let me point you in some better directions anyway.

First, understand that each page is itself spawning multiple threads. It will download the page, and then start spawning off new download threads for elements on the page such as javascript files, css files and images. [Ref: http://blog.marcchung.com/2008/09/05/chromes-process-model-explained.html ]

So depending on how the page is structured, you could end up with a fair number of threads going at the same time just for the page, add on top your trying to do too many loads at once and you have a problem.

The Stack Overflow thread at Optimal number of threads per core gives further information on the situation you are experiencing. Your overloading your cpu.

Your processor is 4 physical 8 logical cores. I would recommend spawning no more than 4 connections at one time, leaving the secondary logical cores to handle some of the threading there. You may find that you even need to reduce this number but 4 is a good starting point. By rendering pages 4 at a time instead of overloading your whole system trying to render 20 you will actually increase your overall speed since you end up with far less cache swapping. Start by clocking your time against several easily timed locations. Then try with less and more. There will be a sweet spot. Note, the headless browser version of PhantomJS is likely going to be better for you as in headless mode it probably won't download the images (a plus).

Your best overall option here though is to do a partial page rendering yourself using the webkit source at http://www.webkit.org/. Since all it seems you need to render is the html and the javascript. This reduces your number of connections and allows you to control your threads with far greater efficiency. You could in that case then create an event queue, spool all your primary url's into there. Spawn off 4 worker threads that all work off of the worker queue, as they process a page and need to download further source they can add those further downloads to the queue. Once all files for a page are downloaded into memory (or disk if your worried about ram) for a particular url, you can then add an item into the event queue to render the page and then parse it for whatever you need.

like image 114
Gimballon Avatar answered Sep 19 '22 14:09

Gimballon