I need to crawl parent web pages and its children web pages and I followed the producer/consumer concept from http://www.albahari.com/threading/part4.aspx#%5FWait%5Fand%5FPulse. Also, I used 5 threads which enqueue and dequeue links.
Any recommendations on how will I end/join all the threads once all of them have finished processing the queue, given that the length of queue is unknown?
Below is the idea on how I coded it.
static void Main(string[] args)
{
//enqueue parent links here
...
//then start crawling via threading
...
}
public void Crawl()
{
//dequeue
//get child links
//enqueue child links
}
If all of your threads are idle (i.e. waiting on the queue) and the queue is empty, then you're done.
An easy way to handle that is to have the threads use a timeout when they're trying to access the queue. Something like BlockingCollection.TryTake. Whenever TryTake
times out, the thread updates a field to say how long it's been idle:
while (!queue.TryTake(out item, 5000, token))
{
if (token.IsCancellationRequested)
break;
// here, update idle counter
}
You can then have a timer that executes every 15 seconds or so to check all of the threads' idle counters. If all threads have been idle for some period of time (a minute, perhaps), then the timer can set the cancellation token. That will kill all the threads. Your main program, too, can be monitoring the cancellation token.
You can do this without BlockingCollection
and cancellation, by the way. You'll just have to create your own cancellation signaling mechanism, and if you're using a lock on the queue, you can replace the lock syntax with Monitor.TryEnter
, etc.
There are several other ways to handle this, although they would require some major restructuring of your program.
You can enqueue a dummy token at the end and have the threads exit when they encounter this token. Like:
public void Crawl()
{
int report = 0;
while(true)
{
if(!(queue.Count == 0))
{
if(report > 0) Interlocked.Decrement(ref report);
//dequeue
if(token == "TERMINATION")
return;
else
//enqueue child links
}
else
{
if(report == num_threads) // all threads have signaled empty queue
queue.Enqueue("TERMINATION");
else
Interlocked.Increment(ref report); // this thread has found the queue empty
}
}
}
Of course, I have omitted the locks for enqueue/dequeue
operations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With