Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

get out links from nutch

I am using nutch 1.3 to crawl a website. I want to get a list of urls crawled, and urls originating from a page.

I get list of urls crawled using readdb command.

bin/nutch readdb crawl/crawldb -dump file

Is there a way to find out urls that are on a page by reading crawldb or linkdb ?

in the org.apache.nutch.parse.html.HtmlParser I see outlinks array, I am wondering if there is a quick way to access it from command line.

like image 324
surajz Avatar asked Sep 15 '11 02:09

surajz


2 Answers

From command line, you can see the outlinks by using readseg with -dump or -get option. For example,

bin/nutch readseg -dump crawl/segments/20110919084424/ outputdir2 -nocontent -nofetch - nogenerate -noparse -noparsetext

less outputdir2/dump
like image 182
surajz Avatar answered Sep 20 '22 15:09

surajz


You can easily do this with readlinkdb command. It gives you all the inlinks and outlinks to and from a url.

bin/nutch readlinkdb <linkdb> (-dump <out_dir> | -url <url>)

linkdb: This is the linkdb directory we wish to read and obtain information from.

out_dir: This parameter dumps the whole linkdb to a text file in any out_dir we wish to specify.

url: The -url arguement provides us with information about a specific url. This is written to System.out.

e.g. 

bin/nutch readlinkdb crawl/linkdb -dump myoutput/out1

For more information refer to http://wiki.apache.org/nutch/bin/nutch%20readlinkdb

like image 36
Sriwantha Attanayake Avatar answered Sep 17 '22 15:09

Sriwantha Attanayake