Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Producer/consumer of a web crawler using queue with unknown size

I need to crawl parent web pages and its children web pages and I followed the producer/consumer concept from http://www.albahari.com/threading/part4.aspx#%5FWait%5Fand%5FPulse. Also, I used 5 threads which enqueue and dequeue links.

Any recommendations on how will I end/join all the threads once all of them have finished processing the queue, given that the length of queue is unknown?

Below is the idea on how I coded it.

static void Main(string[] args)
{
    //enqueue parent links here
    ...
    //then start crawling via threading
    ...
}

public void Crawl()
{
   //dequeue
   //get child links
   //enqueue child links
}
like image 634
user611333 Avatar asked Oct 10 '22 06:10

user611333


2 Answers

If all of your threads are idle (i.e. waiting on the queue) and the queue is empty, then you're done.

An easy way to handle that is to have the threads use a timeout when they're trying to access the queue. Something like BlockingCollection.TryTake. Whenever TryTake times out, the thread updates a field to say how long it's been idle:

while (!queue.TryTake(out item, 5000, token))
{
    if (token.IsCancellationRequested)
        break;
    // here, update idle counter
}

You can then have a timer that executes every 15 seconds or so to check all of the threads' idle counters. If all threads have been idle for some period of time (a minute, perhaps), then the timer can set the cancellation token. That will kill all the threads. Your main program, too, can be monitoring the cancellation token.

You can do this without BlockingCollection and cancellation, by the way. You'll just have to create your own cancellation signaling mechanism, and if you're using a lock on the queue, you can replace the lock syntax with Monitor.TryEnter, etc.

There are several other ways to handle this, although they would require some major restructuring of your program.

like image 91
Jim Mischel Avatar answered Oct 11 '22 23:10

Jim Mischel


You can enqueue a dummy token at the end and have the threads exit when they encounter this token. Like:

public void Crawl()
{
   int report = 0;
   while(true)
   {
       if(!(queue.Count == 0))      
       {   
          if(report > 0) Interlocked.Decrement(ref report);
          //dequeue     
          if(token == "TERMINATION")
             return;
          else
             //enqueue child links
       }
       else
       {              
          if(report == num_threads) // all threads have signaled empty queue
             queue.Enqueue("TERMINATION");
          else
             Interlocked.Increment(ref report); // this thread has found the queue empty
       }
    }
}

Of course, I have omitted the locks for enqueue/dequeue operations.

like image 21
Tudor Avatar answered Oct 12 '22 01:10

Tudor