What is the maximum number of Apache Nutch crawler instances that can run at the same time with one master node?
Nutch takes the injected URLs, stores them in the CrawlDB, and uses those links to go out to the web and scrape each URL. Then, it parses the scraped data into various fields and pushes any scraped hyperlinks back into the CrawlDB.
Apache Nutch is a highly extensible and scalable open source web crawler software project.
Index writers in NutchAn index writer is a component of the indexing job, which is used for sending documents from one or more segments to an external server. In Nutch, these components are found as plugins. Nutch includes these out-of-the-box indexers: Indexer.
Not clear what you mean by crawler instances. If you want to run the crawl script several times in parallel e.g. you have distinct crawls with separate configs, seeds etc... then they will compete for slots on the Hadoop cluster. It will then boil down to how many mapper / reducer slots are available on your cluster, which itself depends on how many slaves are there.
Handling multiple Nutch crawls in parallel can get very tricky and resource inefficient. Instead re-think your architecture so that all the logical crawlers could run as a single physical one or have a look at StormCrawler, which should be a better fit for doing this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With