Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best multi-thread approach for multiple web requests

I want to create a program to crawl and check my websites for http errors and other things. I want to do this with multiple threads that should accept parameters like the url to crawl. Although I want X threads to be active there are Y Tasks waiting already to be executed.

Now I wanted to know what is the best strategy to do this: ThreadPool, Tasks, Threads or even something else?

like image 734
maddo7 Avatar asked Dec 27 '22 06:12

maddo7


2 Answers

Here's an example that shows how to queue up a bunch of tasks but limit the number that are concurrently running . It uses a Queue to keep track of tasks that are ready to run and uses a Dictionary to keep track of tasks that are running. When a task finishes it invokes a callback method to remove itself from the Dictionary. An async method is used to launch queued tasks as space becomes available.

using System;
using System.Collections.Generic;
using System.Threading;
using System.Threading.Tasks;

namespace MinimalTaskDemo
{
    class Program
    {
        private static readonly Queue<Task> WaitingTasks = new Queue<Task>();
        private static readonly Dictionary<int, Task> RunningTasks = new Dictionary<int, Task>();
        public static int MaxRunningTasks = 100; // vary this to dynamically throttle launching new tasks 

        static void Main(string[] args)
        {
            var tokenSource = new CancellationTokenSource();
            var token = tokenSource.Token;
            Worker.Done = new Worker.DoneDelegate(WorkerDone);
            for (int i = 0; i < 1000; i++)  // queue some tasks
            {
                // task state (i) will be our key for RunningTasks
                WaitingTasks.Enqueue(new Task(id => new Worker().DoWork((int)id, token), i, token));
            }
            LaunchTasks();
            Console.ReadKey();
            if (RunningTasks.Count > 0)
            {
                lock (WaitingTasks) WaitingTasks.Clear();
                tokenSource.Cancel();
                Console.ReadKey();
            }
        }

        static async void LaunchTasks()
        {
            // keep checking until we're done
            while ((WaitingTasks.Count > 0) || (RunningTasks.Count > 0))
            {
                // launch tasks when there's room
                while ((WaitingTasks.Count > 0) && (RunningTasks.Count < MaxRunningTasks))
                {
                    Task task = WaitingTasks.Dequeue();
                    lock (RunningTasks) RunningTasks.Add((int)task.AsyncState, task);
                    task.Start();
                }
                UpdateConsole();
                await Task.Delay(300); // wait before checking again
            }
            UpdateConsole();    // all done
        }

        static void UpdateConsole()
        {
            Console.Write(string.Format("\rwaiting: {0,3:##0}  running: {1,3:##0} ", WaitingTasks.Count, RunningTasks.Count));
        }

        // callback from finished worker
        static void WorkerDone(int id)
        {
            lock (RunningTasks) RunningTasks.Remove(id);
        }
    }

    internal class Worker
    {
        public delegate void DoneDelegate(int taskId);
        public static DoneDelegate Done { private get; set; }
        private static readonly Random Rnd = new Random();

        public async void DoWork(object id, CancellationToken token)
        {
            for (int i = 0; i < Rnd.Next(20); i++)
            {
                if (token.IsCancellationRequested) break;
                await Task.Delay(100);  // simulate work
            }
            Done((int)id);
        }
    }
}
like image 129
Ed Power Avatar answered Jan 08 '23 02:01

Ed Power


I recommend using (asynchronous) Tasks for downloading the data and then processing (on the thread pool).

Instead of throttling tasks, I recommend you throttle the number of requests per target server. Good news: .NET already does this for you.

This makes your code as simple as:

private static readonly HttpClient client = new HttpClient();
public async Task Crawl(string url)
{
  var html = await client.GetString(url);
  var nextUrls = await Task.Run(ProcessHtml(html));
  var nextTasks = nextUrls.Select(nextUrl => Crawl(nextUrl));
  await Task.WhenAll(nextTasks);
}
private IEnumerable<string> ProcessHtml(string html)
{
  // return all urls in the html string.
}

which you can kick off with a simple:

await Crawl("http://example.org/");
like image 34
Stephen Cleary Avatar answered Jan 08 '23 02:01

Stephen Cleary