When I run nutch 1.10
with the following command, assuming that TestCrawl2
did not previously exist and needs to be created,...
sudo -E bin/crawl -i -D solr.server.url=http://localhost:8983/solr/TestCrawlCore2 urls/ TestCrawl2/ 20
I receive an error on indexing that claims:
Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/opt/apache-nutch-1.10/TestCrawl2/linkdb/current
The linkdb directory exists, but does not contain the 'current' directory. The directory is owned by root so there should be no permissions issues. Because the process exited from an error, the linkdb directory contains .locked
and ..locked.crc
files. If I run the command again, these lock files cause it to exit in the same place. Delete TestCrawl2
directory, rinse, repeat.
Note that the nutch and solr installaions themselves have run previously without problems in a TestCrawl
instance. It's just now that I'm trying a new one that I'm having problems. Any suggestions on troubleshooting this issue?
Ok, it seems as though I have run into a version of this problem:
https://issues.apache.org/jira/browse/NUTCH-2041
Which is a result of the crawl script not being aware of changes to ignore_external_links my nutch-site.xml file.
I am trying to crawl several sites and was hoping to keep my life simple by ignoring external links and leaving regex-urlfilter.txt alone (just using +.)
Now it looks like I'll have to change ignore_external_links back to false and add a regex filter for each of my urls. Hopefully I can get a nutch 1.11 release soon. It looks like this is fixed there.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With