Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest architecture for multithreaded web crawler

There should be a frontier object - Holding a set of visited and waiting to crawl URL's. There should be some thread responsible for crawling web pages. There would be also some kind of controller object to create crawling threads.

I don't know what architecture would be faster, easier to extend. How to divide responsibilities to make as as few synchronization as possible and also minimize number of checking if current URL has been already visited.

Should controller object be responsible of providing new URL's to working threads - this mean working threads will need to crawl all given URL's and then sleep for undefined time. Controller will be interpreting this threads so crawling thread should handle InterruptedException (How expensive it is in Java - it seems that exception handling is not very fast ). Or maybe controller should only starts the threads and let crawling threads to fetch frontier themselves?

like image 913
Damian Avatar asked Dec 06 '25 03:12

Damian


2 Answers

create a shared, thread-safe list with the URL's to be crawled. create an Executor with the number of threads corresponding to the number of crawlers you desire to run concurrently. start your crawlers as Runnables with a reference to the shared list and submit each of them to the Executor. each crawler removes the next URL from the list and does whatever you need it to do, looping until the list is empty.

like image 131
jtahlborn Avatar answered Dec 08 '25 16:12

jtahlborn


Its been a few years since this question was asked, but in Nov 2015 we are currently using frontera and scrapyd

Scrapy uses twisted which makes it a good multithreaded crawler, and on multi-core machines that means we are only limited by the inbound bandwidth. Frontera-distributed uses hbase and kafka to score links and keep all the data accessible to clients.

like image 25
Sam Texas Avatar answered Dec 08 '25 15:12

Sam Texas



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!