Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DRY search every page of a site with nokogiri

I want to search every page of a site. My thought is to find all links on a page that stay within the domain, visit them, and repeat. I'll have to implement measures to not repeat efforts as well.

So it starts very easily:

page = 'http://example.com'
nf = Nokogiri::HTML(open(page))

links = nf.xpath '//a' #find all links on current page

main_links = links.map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq 

"main_links" is now an array of links from the active page that start with "/" (which should be links on the current domain only).

From here I can feed and read those links into similar code above, but I don't know the best way to ensure I don't repeat myself. I'm thinking I start collecting all the visited links as I visit them:

main_links.each do |ml| 
visited_links = [] #new array of what is visted
np = Nokogiri::HTML(open(page + ml)) #load the first main_link
visted_links.push(ml) #push the page we're on
np_links = np.xpath('//a').map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq #grab all links on this page pointing to the current domain
main_links.push(np_links).compact.uniq #remove duplicates after pushing?
end

I'm still working out this last bit... but does this seem like the proper approach?

Thanks.

like image 749
twinturbotom Avatar asked Dec 16 '22 09:12

twinturbotom


1 Answers

Others have advised you not to write your own web crawler. I agree with this if performance and robustness are your goals. However, it can be a great learning exercise. You wrote this:

"[…] but I don't know the best way to ensure I don't repeat myself"

Recursion is the key here. Something like the following code:

require 'set'
require 'uri'
require 'nokogiri'
require 'open-uri'

def crawl_site( starting_at, &each_page )
  files = %w[png jpeg jpg gif svg txt js css zip gz]
  starting_uri = URI.parse(starting_at)
  seen_pages = Set.new                      # Keep track of what we've seen

  crawl_page = ->(page_uri) do              # A re-usable mini-function
    unless seen_pages.include?(page_uri)
      seen_pages << page_uri                # Record that we've seen this
      begin
        doc = Nokogiri.HTML(open(page_uri)) # Get the page
        each_page.call(doc,page_uri)        # Yield page and URI to the block

        # Find all the links on the page
        hrefs = doc.css('a[href]').map{ |a| a['href'] }

        # Make these URIs, throwing out problem ones like mailto:
        uris = hrefs.map{ |href| URI.join( page_uri, href ) rescue nil }.compact

        # Pare it down to only those pages that are on the same site
        uris.select!{ |uri| uri.host == starting_uri.host }

        # Throw out links to files (this could be more efficient with regex)
        uris.reject!{ |uri| files.any?{ |ext| uri.path.end_with?(".#{ext}") } }

        # Remove #foo fragments so that sub-page links aren't differentiated
        uris.each{ |uri| uri.fragment = nil }

        # Recursively crawl the child URIs
        uris.each{ |uri| crawl_page.call(uri) }

      rescue OpenURI::HTTPError # Guard against 404s
        warn "Skipping invalid link #{page_uri}"
      end
    end
  end

  crawl_page.call( starting_uri )   # Kick it all off!
end

crawl_site('http://phrogz.net/') do |page,uri|
  # page here is a Nokogiri HTML document
  # uri is a URI instance with the address of the page
  puts uri
end

In short:

  • Keep track of what pages you've seen using a Set. Do this not by href value, but by the full canonical URI.
  • Use URI.join to turn possibly-relative paths into the correct URI with respect to the current page.
  • Use recursion to keep crawling every link on every page, but bailing out if you've already seen the page.
like image 184
Phrogz Avatar answered Dec 22 '22 00:12

Phrogz