I want to create a program to crawl and check my websites for http errors and other things. I want to do this with multiple threads that should accept parameters like the url to crawl. Although I want X threads to be active there are Y Tasks waiting already to be executed.
Now I wanted to know what is the best strategy to do this: ThreadPool, Tasks, Threads or even something else?
Here's an example that shows how to queue up a bunch of tasks but limit the number that are concurrently running . It uses a Queue
to keep track of tasks that are ready to run and uses a Dictionary
to keep track of tasks that are running. When a task finishes it invokes a callback method to remove itself from the Dictionary
. An async
method is used to launch queued tasks as space becomes available.
using System;
using System.Collections.Generic;
using System.Threading;
using System.Threading.Tasks;
namespace MinimalTaskDemo
{
class Program
{
private static readonly Queue<Task> WaitingTasks = new Queue<Task>();
private static readonly Dictionary<int, Task> RunningTasks = new Dictionary<int, Task>();
public static int MaxRunningTasks = 100; // vary this to dynamically throttle launching new tasks
static void Main(string[] args)
{
var tokenSource = new CancellationTokenSource();
var token = tokenSource.Token;
Worker.Done = new Worker.DoneDelegate(WorkerDone);
for (int i = 0; i < 1000; i++) // queue some tasks
{
// task state (i) will be our key for RunningTasks
WaitingTasks.Enqueue(new Task(id => new Worker().DoWork((int)id, token), i, token));
}
LaunchTasks();
Console.ReadKey();
if (RunningTasks.Count > 0)
{
lock (WaitingTasks) WaitingTasks.Clear();
tokenSource.Cancel();
Console.ReadKey();
}
}
static async void LaunchTasks()
{
// keep checking until we're done
while ((WaitingTasks.Count > 0) || (RunningTasks.Count > 0))
{
// launch tasks when there's room
while ((WaitingTasks.Count > 0) && (RunningTasks.Count < MaxRunningTasks))
{
Task task = WaitingTasks.Dequeue();
lock (RunningTasks) RunningTasks.Add((int)task.AsyncState, task);
task.Start();
}
UpdateConsole();
await Task.Delay(300); // wait before checking again
}
UpdateConsole(); // all done
}
static void UpdateConsole()
{
Console.Write(string.Format("\rwaiting: {0,3:##0} running: {1,3:##0} ", WaitingTasks.Count, RunningTasks.Count));
}
// callback from finished worker
static void WorkerDone(int id)
{
lock (RunningTasks) RunningTasks.Remove(id);
}
}
internal class Worker
{
public delegate void DoneDelegate(int taskId);
public static DoneDelegate Done { private get; set; }
private static readonly Random Rnd = new Random();
public async void DoWork(object id, CancellationToken token)
{
for (int i = 0; i < Rnd.Next(20); i++)
{
if (token.IsCancellationRequested) break;
await Task.Delay(100); // simulate work
}
Done((int)id);
}
}
}
I recommend using (asynchronous) Task
s for downloading the data and then processing (on the thread pool).
Instead of throttling tasks, I recommend you throttle the number of requests per target server. Good news: .NET already does this for you.
This makes your code as simple as:
private static readonly HttpClient client = new HttpClient();
public async Task Crawl(string url)
{
var html = await client.GetString(url);
var nextUrls = await Task.Run(ProcessHtml(html));
var nextTasks = nextUrls.Select(nextUrl => Crawl(nextUrl));
await Task.WhenAll(nextTasks);
}
private IEnumerable<string> ProcessHtml(string html)
{
// return all urls in the html string.
}
which you can kick off with a simple:
await Crawl("http://example.org/");
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With