I am using nutch 1.3 to crawl a website. I want to get a list of urls crawled, and urls originating from a page.
I get list of urls crawled using readdb command.
bin/nutch readdb crawl/crawldb -dump file
Is there a way to find out urls that are on a page by reading crawldb or linkdb ?
in the org.apache.nutch.parse.html.HtmlParser
I see outlinks array, I am wondering if there is a quick way to access it from command line.
From command line, you can see the outlinks by using readseg with -dump or -get option. For example,
bin/nutch readseg -dump crawl/segments/20110919084424/ outputdir2 -nocontent -nofetch - nogenerate -noparse -noparsetext
less outputdir2/dump
You can easily do this with readlinkdb command. It gives you all the inlinks and outlinks to and from a url.
bin/nutch readlinkdb <linkdb> (-dump <out_dir> | -url <url>)
linkdb: This is the linkdb directory we wish to read and obtain information from.
out_dir: This parameter dumps the whole linkdb to a text file in any out_dir we wish to specify.
url: The -url arguement provides us with information about a specific url. This is written to System.out.
e.g.
bin/nutch readlinkdb crawl/linkdb -dump myoutput/out1
For more information refer to http://wiki.apache.org/nutch/bin/nutch%20readlinkdb
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With