Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get a mail address from HTML code with Nokogiri

Tags:

ruby

nokogiri

How can I get the mail address from HTML code with Nokogiri? I'm thinking in regex but I don't know if it's the best solution.

Example code:

<html>
<title>Example</title>
<body>
This is an example text.
<a href="mailto:[email protected]">Mail to me</a>
</body>
</html>

Does a method exist in Nokogiri to get the mail address if it is not between some tags?

like image 222
jgiunta Avatar asked Feb 29 '12 01:02

jgiunta


1 Answers

You can extract the email addresses using xpath.

The selector //a will select any a tags on the page, and you can specify the href attribute using @ syntax, so //a/@href will give you the hrefs of all a tags on the page.

If there are a mix of possible a tags on the page with different urls types (e.g. http:// urls) you can use xpath functions to further narrow down the selected nodes. The selector

//a[starts-with(@href, \"mailto:\")]/@href

will give you the href nodes of all a tags that have a href attribute that starts with "mailto:".

Putting this all together, and adding a little extra code to strip out the "mailto:" from the start of the attribute value:

require 'nokogiri'

selector = "//a[starts-with(@href, \"mailto:\")]/@href"

doc = Nokogiri::HTML.parse File.read 'my_file.html'

nodes = doc.xpath selector

addresses = nodes.collect {|n| n.value[7..-1]}

puts addresses

With a test file that looks like this:

<html>
<title>Example</title>
<body>
This is an example text.
<a href="mailto:[email protected]">Mail to me</a>
<a href="http://example.com">A Web link</a>
<a>An empty anchor.</a>
</body>
</html>

this code outputs the desired [email protected]. addresses is an array of all the email addresses in mailto links in the document.

like image 111
matt Avatar answered Nov 01 '22 11:11

matt