Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Web Cralwer Algorithm: depth?

I'm working on a crawler and need to understand exactly what is meant by "link depth". Take nutch for example: http://wiki.apache.org/nutch/NutchTutorial

depth indicates the link depth from the root page that should be crawled.

So, say I have the domain www.domain.com and wanted to crawl a depth of, say, 3 -- what do I need to do? If a site could be represented as a binary tree, then it wouldn't be a problem I think.

like image 590
StackOverflowNewbie Avatar asked Dec 06 '22 00:12

StackOverflowNewbie


2 Answers

Link depth means the number of "hops" a page is be away from the root, where a "hop" means following a link on a page. The reason Nutch has this restriction is that links very "far away" from the main page are unlikely to hold much information (the main page will link to the most important information, so the farther you get, the more detailed info you find), while there can be very many of them, so they take up lots of storage space, computing time for ranking, and bandwidth.

Nutch thus uses an algorithm scheme known as depth-limited search to bound its running time and space usage. If it didn't use this heuristic, it would have to crawl an entire site to rank all the pages in it and find the top N.

To crawl to depth 3, implement this algorithm and give it a depth bound of three. The nice thing about depth-limited search is that it's a variant of depth-first search (DFS), so it's quite space-efficient:

function depth-limited-crawl(page p, int d)
    if d == 0
        return
    /* do something with p, store it or so */
    foreach (page l in links(p))
        depth-limited-crawl(linked, d-1)

And no, a site cannot in general be represented as a binary tree; it's a directed graph. If you somehow remove the backlinks, then it becomes a multiway tree. Either way, many sites are too large to store for your crawler.

like image 132
Fred Foo Avatar answered Feb 11 '23 23:02

Fred Foo


  1. Make a list that you use as a queue.
  2. Append www.domain.com, depth 0 to it
  3. Pull the first element off it
  4. current depth is the elements depth+1
  5. Crawl that site
  6. Append each link on the site to the queue if the current depth isn't greater than the maximum depth
  7. If the list isn't empty, go back to 3..
like image 43
thejh Avatar answered Feb 12 '23 01:02

thejh