How would I go about downloading and executing (i.e evaluate Javascript, build DOM) in excess of 1000 XHTML documents per minute?
Some outlines/constraints:
I am not so concerned about downloading the pages. I estimate that actually excuting the page is a bottleneck. .NET has a built in Web Browser object but I have no idea if it would scale up on a single machine. Also, .NET is not an absolute requirement but it would make integration around here easier.
I'd be grateful for any comments/pointers regarding:
Thank you in advance,
/David
Look at one of the headless browsers for .NET - they will be faster than the BrowserControl as they don't need to render a graphical view.
I don't know if this will allow you to execute 1000 pages per minute, but should be much faster than the control.
Here is one.
Here is a blog post about using HtmlUnit as a headless browser.
And an SO question about headless browsers.
I have an application that is implemented in WinForms that processes ~7,800 URLs in approximately 5 minutes (downloads the URL, parses the content, looks for specific pieces of data and if it finds what its looking for does some additional processing on that page.
This specific application used to take between 26 to 30 minutes to run, but by changing the code to the TPL (Task Parallel Library in .NET v4.0) it executes in just 5. The computer is a Dell T7500 workstation with dual quad core Xeon processors (3 GHz), running with 24 GB of RAM, and Windows 7 Ultimate 64-bit edition.
I simply use WebClient, Stream, and StreamReader objects within a Parallel.ForEach() loop, and it's extremely fast.
Probably not the exact solution you're looking for, but unlike most of the other postings I see here this actualy does "process 1,000 pages / minute" [and more].
Food for thought ...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With