Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Executing 1000+ pages/min in a browser environment

How would I go about downloading and executing (i.e evaluate Javascript, build DOM) in excess of 1000 XHTML documents per minute?

Some outlines/constraints:

  • URLs to be downloaded are on different servers.
  • I need to traverse - and ideally modify the resulting DOM.
  • No interest in rendering the graphics.
  • Bandwidth is not an issue.
  • Overly massive hardware parallelization would be more of a problem.
  • Production enviroment is .NET.

I am not so concerned about downloading the pages. I estimate that actually excuting the page is a bottleneck. .NET has a built in Web Browser object but I have no idea if it would scale up on a single machine. Also, .NET is not an absolute requirement but it would make integration around here easier.

I'd be grateful for any comments/pointers regarding:

  • Which browser API is most suited to do this?
  • Is a browser the right way to go - maybe there's a more lightweight way to execute the Javascript which is the most important part (... but does not provide a DOM)?
  • What existing products/services - be they open source or commerical - may accomplish the task?
  • Roughly how many pages per minute I can expect to handle on a single machine (3ms Chrome rendering commercial anyone)?
  • Any pitfalls one is likely to encounter...

Thank you in advance,

/David

like image 232
OG Dude Avatar asked May 27 '26 05:05

OG Dude


2 Answers

Look at one of the headless browsers for .NET - they will be faster than the BrowserControl as they don't need to render a graphical view.

I don't know if this will allow you to execute 1000 pages per minute, but should be much faster than the control.

Here is one.

Here is a blog post about using HtmlUnit as a headless browser.

And an SO question about headless browsers.

like image 101
Oded Avatar answered May 28 '26 17:05

Oded


I have an application that is implemented in WinForms that processes ~7,800 URLs in approximately 5 minutes (downloads the URL, parses the content, looks for specific pieces of data and if it finds what its looking for does some additional processing on that page.

This specific application used to take between 26 to 30 minutes to run, but by changing the code to the TPL (Task Parallel Library in .NET v4.0) it executes in just 5. The computer is a Dell T7500 workstation with dual quad core Xeon processors (3 GHz), running with 24 GB of RAM, and Windows 7 Ultimate 64-bit edition.

I simply use WebClient, Stream, and StreamReader objects within a Parallel.ForEach() loop, and it's extremely fast.

Probably not the exact solution you're looking for, but unlike most of the other postings I see here this actualy does "process 1,000 pages / minute" [and more].

Food for thought ...

like image 30
BonanzaDriver Avatar answered May 28 '26 18:05

BonanzaDriver



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!