Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

nutch 1.10 input path does not exist /linkdb/current

Tags:

solr

hadoop

nutch

When I run nutch 1.10 with the following command, assuming that TestCrawl2 did not previously exist and needs to be created,...

sudo -E bin/crawl -i -D solr.server.url=http://localhost:8983/solr/TestCrawlCore2 urls/ TestCrawl2/ 20

I receive an error on indexing that claims:

Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/opt/apache-nutch-1.10/TestCrawl2/linkdb/current

The linkdb directory exists, but does not contain the 'current' directory. The directory is owned by root so there should be no permissions issues. Because the process exited from an error, the linkdb directory contains .locked and ..locked.crc files. If I run the command again, these lock files cause it to exit in the same place. Delete TestCrawl2 directory, rinse, repeat.

Note that the nutch and solr installaions themselves have run previously without problems in a TestCrawl instance. It's just now that I'm trying a new one that I'm having problems. Any suggestions on troubleshooting this issue?

like image 665
Anonymous Man Avatar asked Nov 03 '15 20:11

Anonymous Man


1 Answers

Ok, it seems as though I have run into a version of this problem:

https://issues.apache.org/jira/browse/NUTCH-2041

Which is a result of the crawl script not being aware of changes to ignore_external_links my nutch-site.xml file.

I am trying to crawl several sites and was hoping to keep my life simple by ignoring external links and leaving regex-urlfilter.txt alone (just using +.)

Now it looks like I'll have to change ignore_external_links back to false and add a regex filter for each of my urls. Hopefully I can get a nutch 1.11 release soon. It looks like this is fixed there.

like image 82
Anonymous Man Avatar answered Sep 22 '22 00:09

Anonymous Man