Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regular expression for finding 'href' value of a <a> link

Tags:

c#

regex

I need a regex pattern for finding web page links in HTML.

I first use @"(<a.*?>.*?</a>)" to extract links (<a>), but I can't fetch href from that.

My strings are:

  1. <a href="www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
  2. <a href="http://www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
  3. <a href="https://www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
  4. <a href="www.example.com/page.php/404" ....></a>

1, 2 and 3 are valid and I need them, but number 4 is not valid for me (? and = is essential)


Thanks everyone, but I don't need parsing <a>. I have a list of links in href="abcdef" format.

I need to fetch href of the links and filter it, my favorite urls must be contain ? and = like page.php?id=5

Thanks!

like image 395
MrRolling Avatar asked Apr 10 '13 12:04

MrRolling


People also ask

What is href =# in HTML?

Definition and Usage. The href attribute specifies the URL of the page the link goes to. If the href attribute is not present, the <a> tag will not be a hyperlink. Tip: You can use href="#top" or href="#" to link to the top of the current page!

What is an href value?

The attribute value of href (inside the quotation marks) is a URL that tells the browser where to go when the link is selected. Note the additional attributes target=“_blank” and rel=“noopener” — these tell the browser to open the web page in a new tab.

What does regex (? S match?

i) makes the regex case insensitive. (? s) for "single line mode" makes the dot match all characters, including line breaks.


2 Answers

I'd recommend using an HTML parser over a regex, but still here's a regex that will create a capturing group over the value of the href attribute of each links. It will match whether double or single quotes are used.

<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1 

You can view a full explanation of this regex at here.

Snippet playground:

const linkRx = /<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1/;  const textToMatchInput = document.querySelector('[name=textToMatch]');    document.querySelector('button').addEventListener('click', () => {    console.log(textToMatchInput.value.match(linkRx));  });
<label>    Text to match:    <input type="text" name="textToMatch" value='<a href="google.com"'>        <button>Match</button>   </label>
like image 145
plalx Avatar answered Sep 23 '22 06:09

plalx


Using regex to parse html is not recommended

regex is used for regularly occurring patterns.html is not regular with it's format(except xhtml).For example html files are valid even if you don't have a closing tag!This could break your code.

Use an html parser like htmlagilitypack

You can use this code to retrieve all href's in anchor tag using HtmlAgilityPack

HtmlDocument doc = new HtmlDocument(); doc.Load(yourStream);  var hrefList = doc.DocumentNode.SelectNodes("//a")                   .Select(p => p.GetAttributeValue("href", "not found"))                   .ToList(); 

hrefList contains all href`s

like image 38
Anirudha Avatar answered Sep 23 '22 06:09

Anirudha