How can I get the mail address from HTML code with Nokogiri? I'm thinking in regex but I don't know if it's the best solution.
Example code:
<html>
<title>Example</title>
<body>
This is an example text.
<a href="mailto:[email protected]">Mail to me</a>
</body>
</html>
Does a method exist in Nokogiri to get the mail address if it is not between some tags?
You can extract the email addresses using xpath.
The selector //a
will select any a
tags on the page, and you can specify the href
attribute using @
syntax, so //a/@href
will give you the href
s of all a
tags on the page.
If there are a mix of possible a
tags on the page with different urls types (e.g. http://
urls) you can use xpath functions to further narrow down the selected nodes. The selector
//a[starts-with(@href, \"mailto:\")]/@href
will give you the href nodes of all a
tags that have a href
attribute that starts with "mailto:".
Putting this all together, and adding a little extra code to strip out the "mailto:" from the start of the attribute value:
require 'nokogiri'
selector = "//a[starts-with(@href, \"mailto:\")]/@href"
doc = Nokogiri::HTML.parse File.read 'my_file.html'
nodes = doc.xpath selector
addresses = nodes.collect {|n| n.value[7..-1]}
puts addresses
With a test file that looks like this:
<html>
<title>Example</title>
<body>
This is an example text.
<a href="mailto:[email protected]">Mail to me</a>
<a href="http://example.com">A Web link</a>
<a>An empty anchor.</a>
</body>
</html>
this code outputs the desired [email protected]
. addresses
is an array of all the email addresses in mailto links in the document.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With