I am trying to set up Apache Nutch to crawl URLs, following this guide. Being an older guide (The guide is for 1.x, I am using 2.3), I have made the necessary changes to structure. However, when I try to run a crawl, I get this error :
root@IndiStage:~# /usr/local/nutch/framework/apache-nutch-2.3/src/bin/crawl urls FirstCrawl 2
No SOLRURL specified. Skipping indexing.
Injecting seed URLs
/usr/local/nutch/framework/apache-nutch-2.3/src/bin/nutch inject urls -crawlId FirstCrawl
Error: Could not find or load main class org.apache.nutch.crawl.InjectorJob
Error running:
/usr/local/nutch/framework/apache-nutch-2.3/src/bin/nutch inject urls -crawlId FirstCrawl
Failed with exit value 1.
root@IndiStage:~#
Being new to Ubuntu (14.04), I am finding it hard to manage the directory structure and paths here.
InjectorJob
is in /usr/local/nutch/framework/apache-nutch-2.3/src/java/org/apache/nutch/crawl
JAVA_HOME
is set to /usr/lib/jvm/java-7-openjdk-amd64
Make sure that you already compile the Nutch source code. Then, run the crawl command from ${APACHE_NUTCH_HOME}/runtime/local (or ${APACHE_NUTCH_HOME}/runtime/deploy/bin).
Hope this helps,
Le Quoc Do
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With