Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using regex to get title

I'm not sure how I'd select an title with regex. I've tried

match(/<title>(.*) .*<\/title>/)[1]

but that doesn't match anything.

This is the response body I'm trying to select from.

Trying to select "title I need to select."

like image 517
user3579614 Avatar asked Feb 16 '26 21:02

user3579614


2 Answers

The reason it doesn't work is because of the itemprop=\"name\" property. To fix this, you can match it as well:

# copy-paste from the page you provided
html = '<!doctype html>\n<html lang=\"en\" itemscope itemtype=\"https://schema.org/WebPage\">\n<head>\n<meta charset=\"utf-8\"><meta name=\"referrer\" content=\"always\" />\n<title itemprop=\"name\">title I need to select.</title>\n<meta itemprop=\"description\" name=\"description\" content=\\'

html.match(/<title.*?>(.*)<\/title>/)[1] # => "title I need to select."

.*? basically means "match as many characters are needed, but not more"


However, as other have pointed out, regexes are not ideal for html parsing. Instead, you could use a popular ruby gem for that purpose - Nokogiri:

require 'nokogiri'

page = Nokogiri.parse(html)
page.css('title').text # => "title I need to select."

Note that it can handle even malformed html like is the case here.

like image 163
ndnenkov Avatar answered Feb 19 '26 12:02

ndnenkov


If you're looking for a much more robust XML/HTML parser, try using Nokogiri which supports XPath.

This post explains why Use xPath or Regex?

require "nokogiri"
string = "<title itemprop=\"name\">title I need to select.</title>"
html_doc = Nokogiri::HTML(string)
html_doc.xpath("//title").first.text
like image 23
jeremy04 Avatar answered Feb 19 '26 12:02

jeremy04