I'm not sure how I'd select an title with regex. I've tried
match(/<title>(.*) .*<\/title>/)[1]
but that doesn't match anything.
This is the response body I'm trying to select from.
Trying to select "title I need to select."
The reason it doesn't work is because of the itemprop=\"name\" property. To fix this, you can match it as well:
# copy-paste from the page you provided
html = '<!doctype html>\n<html lang=\"en\" itemscope itemtype=\"https://schema.org/WebPage\">\n<head>\n<meta charset=\"utf-8\"><meta name=\"referrer\" content=\"always\" />\n<title itemprop=\"name\">title I need to select.</title>\n<meta itemprop=\"description\" name=\"description\" content=\\'
html.match(/<title.*?>(.*)<\/title>/)[1] # => "title I need to select."
.*? basically means "match as many characters are needed, but not more"
However, as other have pointed out, regexes are not ideal for html parsing. Instead, you could use a popular ruby gem for that purpose - Nokogiri:
require 'nokogiri'
page = Nokogiri.parse(html)
page.css('title').text # => "title I need to select."
Note that it can handle even malformed html like is the case here.
If you're looking for a much more robust XML/HTML parser, try using Nokogiri which supports XPath.
This post explains why Use xPath or Regex?
require "nokogiri"
string = "<title itemprop=\"name\">title I need to select.</title>"
html_doc = Nokogiri::HTML(string)
html_doc.xpath("//title").first.text
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With