Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to pass different instances while multithreading?

I am building a scraper. My goal is to start X browsers (where X is number of threads) and proceed to scrape a list of URLs with each of them by splitting that list in X parts.

I decide to use 3 threads (3 browsers) with list of 10 URLs.

Question: How to separate each task between the browsers like this:

  1. Browser1 scrapes items in the list from 0 to 3

  2. Browser2 scrapes items in the list from 4 to 7

  3. Browser3 scrapes items in the list from 8 to 10

All browsers should be working at the same time scraping the passed list of URLs.

I already have this BlockingCollection:

BlockingCollection<Action> _taskQ = new BlockingCollection<Action>();

public Multithreading(int workerCount)
{
        // Create and start a separate Task for each consumer:
        for (int i = 0; i < workerCount; i++)
            Task.Factory.StartNew(Consume);
}

public void Dispose() { _taskQ.CompleteAdding(); }

public void EnqueueTask(Action action) { _taskQ.Add(action); }

void Consume()
{
// This sequence that we’re enumerating will block when no elements
// are available and will end when CompleteAdding is called. 
foreach (Action action in _taskQ.GetConsumingEnumerable())
            action();     // Perform task.
}
public int ItemsCount()
{
        return _taskQ.Count;
}

It can be used like this:

Multithreading multithread = new Multithreading(3); //3 threads
foreach(string url in urlList){

    multithread.EnqueueTask(new Action(() => 
    {
         startScraping(browser1); //or browser2 or browser3
    }));
}

I need to create the browsers instances before scraping, because I do not want to start a new browser with every thread.

like image 200
A Petrov Avatar asked Oct 31 '22 18:10

A Petrov


2 Answers

Taking Henk Holtermans comment into account that you may want maximum speed, i.e. keep browsers busy as much as possible, use this:

private static void StartScraping(int id, IEnumerable<Uri> urls)
{
    // Construct browser here
    foreach (Uri url in urls)
    {
        // Use browser to process url here
        Console.WriteLine("Browser {0} is processing url {1}", id, url);
    }
}

in main:

    int nrWorkers = 3;
    int nrUrls = 10;
    BlockingCollection<Uri> taskQ = new BlockingCollection<Uri>();
    foreach (int i in Enumerable.Range(0, nrWorkers))
    {
        Task.Run(() => StartScraping(i, taskQ.GetConsumingEnumerable()));
    }
    foreach (int i in Enumerable.Range(0, nrUrls))
    {
        taskQ.Add(new Uri(String.Format("http://Url{0}", i)));
    }
    taskQ.CompleteAdding();
like image 144
Emile Avatar answered Nov 15 '22 06:11

Emile


I think the usual approach is to have a single blocking queue, a provider thread and an arbitrary pool of workers.

The provider thread is responsible for adding URLs to the queue. It blocks when there are none to add.

A worker thread instantiates a browser, and then retrieves a single URL from the queue, scrapes it and then loops back for more. It blocks when the queue is empty.

You can start as many workers as you like, and they just sort it out between them.

The mainline starts all the threads and retires to the sidelines. It looks after the UI, if there is one.

Multithreading can be really hard to debug. You might want to look at using Tasks for at least part of the job.

like image 45
david.pfx Avatar answered Nov 15 '22 06:11

david.pfx