Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

when building a web spider, should you use recursion?

Tags:

ruby

building a depth-first web spider, meaning it will visit all links on first page, and go to each link, and visit links on all second page...

should you use recursion ? i find this to be cpu intensive.

def recursion()

linkz_on_first_page.each do |link|

recursion(link)

end
end
recursion(firstpage)
like image 517
pgh Avatar asked Nov 25 '09 11:11

pgh


1 Answers

Definitely not, you're going to run into problems very quickly because of the actual nature of the world wide web. The second you hit a site with a main navigation section, where each page links to each other page, you've entered an infinite loop.

You could keep track of which links you have handled, but even then, a recursive loop doesn't really fit the nature of the world wide web (although at first thought it seems to, the web is more of an actual "web" than a tree). You're better off finding all links on the current page and adding those links (if they don't already exist) to a central queue, and proceeding iteratively through the queue processing every link as you come to it (remember to keep track of links that you've finished processing, or you'll add them to the end of the queue again)

like image 169
LorenVS Avatar answered Sep 30 '22 18:09

LorenVS