Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Maximum number of Apache Nutch worker instances

Tags:

hadoop

nutch

What is the maximum number of Apache Nutch crawler instances that can run at the same time with one master node?

like image 471
Sanaz Marshall Avatar asked Dec 17 '15 02:12

Sanaz Marshall


People also ask

How does Apache Nutch work?

Nutch takes the injected URLs, stores them in the CrawlDB, and uses those links to go out to the web and scrape each URL. Then, it parses the scraped data into various fields and pushes any scraped hyperlinks back into the CrawlDB.

Is Apache Nutch open source?

Apache Nutch is a highly extensible and scalable open source web crawler software project.

What is nutch indexing?

Index writers in NutchAn index writer is a component of the indexing job, which is used for sending documents from one or more segments to an external server. In Nutch, these components are found as plugins. Nutch includes these out-of-the-box indexers: Indexer.


1 Answers

Not clear what you mean by crawler instances. If you want to run the crawl script several times in parallel e.g. you have distinct crawls with separate configs, seeds etc... then they will compete for slots on the Hadoop cluster. It will then boil down to how many mapper / reducer slots are available on your cluster, which itself depends on how many slaves are there.

Handling multiple Nutch crawls in parallel can get very tricky and resource inefficient. Instead re-think your architecture so that all the logical crawlers could run as a single physical one or have a look at StormCrawler, which should be a better fit for doing this.

like image 170
Julien Nioche Avatar answered Oct 14 '22 06:10

Julien Nioche