I need to download a lot of pages through proxies. What is best practice for building a multi-threaded web crawler?
Is Parallel.For\Foreach is good enough or is it better for heavy CPU tasks?
What do you say about following code?
var multyProxy = new MultyProxy();
multyProxy.LoadProxyList();
Task[] taskArray = new Task[1000];
for(int i = 0; i < taskArray.Length; i++)
{
taskArray[i] = new Task( (obj) =>
{
multyProxy.GetPage((string)obj);
},
(object)"http://google.com"
);
taskArray[i].Start();
}
Task.WaitAll(taskArray);
It's working horribly. It's very slow and I don't know why.
This code is also working bad.
System.Threading.Tasks.Parallel.For(0,1000, new System.Threading.Tasks.ParallelOptions(){MaxDegreeOfParallelism=30},loop =>
{
multyProxy.GetPage("http://google.com");
}
);
Well i think that i am doing something wrong.
When i starting my script it use network only at 2%-4%.
You are basically using up CPU bound threads for IO bound tasks - ie. even though you're parallelizing your operations, they're still using up essentially a ThreadPool thread, which is mainly intended for CPU bound operations.
Basically you need to use an async pattern for downloading the data to change it to using IO completion ports - if you're using WebRequest, then the BeginGetResponse() and EndGetResponse() methods
I would suggest looking at Reactive Extensions to do this, eg:
IEnumerable<string> urls = ... get your urls here...;
var results = from url in urls.ToObservable()
let req = WebRequest.Create(url)
from rsp in Observable.FromAsyncPattern<WebResponse>(
req.BeginGetResponse, req.EndGetResponse)()
select ExtractResponse(rsp);
where ExtractResponse probably just uses a StreamReader.ReadToEnd to get the string results if that's what you're after
You can also look at using the .Retry operator then which will easily allow you to retry a few times if you get connection issues etc...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With