I want to search every page of a site. My thought is to find all links on a page that stay within the domain, visit them, and repeat. I'll have to implement measures to not repeat efforts as well.
So it starts very easily:
page = 'http://example.com'
nf = Nokogiri::HTML(open(page))
links = nf.xpath '//a' #find all links on current page
main_links = links.map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq
"main_links" is now an array of links from the active page that start with "/" (which should be links on the current domain only).
From here I can feed and read those links into similar code above, but I don't know the best way to ensure I don't repeat myself. I'm thinking I start collecting all the visited links as I visit them:
main_links.each do |ml|
visited_links = [] #new array of what is visted
np = Nokogiri::HTML(open(page + ml)) #load the first main_link
visted_links.push(ml) #push the page we're on
np_links = np.xpath('//a').map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq #grab all links on this page pointing to the current domain
main_links.push(np_links).compact.uniq #remove duplicates after pushing?
end
I'm still working out this last bit... but does this seem like the proper approach?
Thanks.
Others have advised you not to write your own web crawler. I agree with this if performance and robustness are your goals. However, it can be a great learning exercise. You wrote this:
"[…] but I don't know the best way to ensure I don't repeat myself"
Recursion is the key here. Something like the following code:
require 'set'
require 'uri'
require 'nokogiri'
require 'open-uri'
def crawl_site( starting_at, &each_page )
files = %w[png jpeg jpg gif svg txt js css zip gz]
starting_uri = URI.parse(starting_at)
seen_pages = Set.new # Keep track of what we've seen
crawl_page = ->(page_uri) do # A re-usable mini-function
unless seen_pages.include?(page_uri)
seen_pages << page_uri # Record that we've seen this
begin
doc = Nokogiri.HTML(open(page_uri)) # Get the page
each_page.call(doc,page_uri) # Yield page and URI to the block
# Find all the links on the page
hrefs = doc.css('a[href]').map{ |a| a['href'] }
# Make these URIs, throwing out problem ones like mailto:
uris = hrefs.map{ |href| URI.join( page_uri, href ) rescue nil }.compact
# Pare it down to only those pages that are on the same site
uris.select!{ |uri| uri.host == starting_uri.host }
# Throw out links to files (this could be more efficient with regex)
uris.reject!{ |uri| files.any?{ |ext| uri.path.end_with?(".#{ext}") } }
# Remove #foo fragments so that sub-page links aren't differentiated
uris.each{ |uri| uri.fragment = nil }
# Recursively crawl the child URIs
uris.each{ |uri| crawl_page.call(uri) }
rescue OpenURI::HTTPError # Guard against 404s
warn "Skipping invalid link #{page_uri}"
end
end
end
crawl_page.call( starting_uri ) # Kick it all off!
end
crawl_site('http://phrogz.net/') do |page,uri|
# page here is a Nokogiri HTML document
# uri is a URI instance with the address of the page
puts uri
end
In short:
Set
. Do this not by href
value, but by the full canonical URI.URI.join
to turn possibly-relative paths into the correct URI with respect to the current page.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With