Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I get all links of a website using the Mechanize gem?

Tags:

ruby

mechanize

How can i get all links of a website using ruby Mechanize gem? Does Mechanize can do like Anemone gem:

Anemone.crawl("https://www.google.com.vn/") do |anemone|
  anemone.on_every_page do |page|
    puts page.url
  end
end

I'm newbie in web crawler. Thanks in advance!

like image 957
1Rhino Avatar asked Sep 11 '14 07:09

1Rhino


People also ask

How do I find all links in a website using hexomatic?

How do I find all links in a website? To find all the links in a website, including the page’s URL, source URLs, Internal and external links, you can use Hexomatic’s Crawler built-in automation. Simply insert the website domain in the automation, select which links are needed to be scraped, and run the workflow.

How to use mechanize library in Python?

To use the mechanize library, download it's tar.gz file from here. Extract the tar file and install it using python setup.py install Mechanize's primary class, Browser, allows the manipulation of anything that can be manipulated inside a browser.

Is there a list of all links in a website?

Sitemap haves a list of every URL from your website. How do I find all links in a website? To find all the links in a website, including the page’s URL, source URLs, Internal and external links, you can use Hexomatic’s Crawler built-in automation.

What is browser in mechanize?

Mechanize's primary class, Browser, allows the manipulation of anything that can be manipulated inside a browser. Let's see an example to view source code of a website using Mechanize Library:


1 Answers

It's quite simple with Mechanize, and I suggest you to read the documentation. You can start with Ruby BastardBook.

To get all links from a page with Mechanize try this:

require 'mechanize'

agent = Mechanize.new
page = agent.get("http://example.com")
page.links.each {|link| puts "#{link.text} => #{link.href}"}

The code is clear I think. page is a Mechanize::Page object that stores the whole content of the retrieved page. Mechanize::Page has the links method.

Mechanize is very powerful, but remember that if you want to do scraping without any interaction with the website use Nokogiri. Mechanize uses Nokogiri to scrap the web, so for scraping only use Nokogiri.

like image 112
MonkTools Avatar answered Oct 13 '22 10:10

MonkTools