Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to run apache nutch different jobs in parallel manner

I am using nutch 2.3. All jobs run one after other i.e. first generator, fetch, parse, index etc. I want to run some jobs simultaneously. I know some jobs cannot run in parallel but other can e.g parse job, dbupdate, indexjob should be run with fetch.

Is it possible ? My basic objective is to run fetcher job all the time. I suppose that we can do it with different timestamp. Can anyone guide me the proper way ?

like image 734
Hafiz Muhammad Shafiq Avatar asked May 05 '15 06:05

Hafiz Muhammad Shafiq


People also ask

How to run multiple jobs in a GitHub workflow in parallel?

Go to your Github repository b. Go to .github /workflow/*.yml c. In order to run the jobs in parallel, we have to define the “n” number of jobs in our .yml file.

What are the chances of network probelm occur during Nutch?

There is a possible chance that during any stage of nutch (fetch parse index etc.), network probelm occur or power shutdown happen. How I can resume previous incomplete job.

How many tasks can be run simultaneously in Linux parallel?

The --ntasks option is set to 28, so at most 28 tasks can be run simultaneously. #!/bin/sh # This script outputs some useful information so we can see what parallel # and srun are doing. sleepsecs= $ [ ( $RANDOM % 10 ) + 10 ] s # $1 is arg1: {1} from GNU parallel. # # $PARALLEL_SEQ is a special variable from GNU parallel.

How does a parallel program work?

The parallel program executes tasks simultaneously until all tasks have been completed.


1 Answers

If you check out the nutch web app server, you will find out that it can execute multiple crawl job in parallel.You should check out the source code of the Nutch 2.3 for webapp[NutchUiServer]. Hope this helps.

like image 101
Mubin Shrestha Avatar answered Nov 15 '22 17:11

Mubin Shrestha