I need a regex pattern for finding web page links in HTML.
I first use @"(<a.*?>.*?</a>)"
to extract links (<a>
), but I can't fetch href
from that.
My strings are:
<a href="www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
<a href="http://www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
<a href="https://www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
<a href="www.example.com/page.php/404" ....></a>
1, 2 and 3 are valid and I need them, but number 4 is not valid for me (?
and =
is essential)
Thanks everyone, but I don't need parsing <a>
. I have a list of links in href="abcdef"
format.
I need to fetch href
of the links and filter it, my favorite urls must be contain ?
and =
like page.php?id=5
Thanks!
Definition and Usage. The href attribute specifies the URL of the page the link goes to. If the href attribute is not present, the <a> tag will not be a hyperlink. Tip: You can use href="#top" or href="#" to link to the top of the current page!
The attribute value of href (inside the quotation marks) is a URL that tells the browser where to go when the link is selected. Note the additional attributes target=“_blank” and rel=“noopener” — these tell the browser to open the web page in a new tab.
i) makes the regex case insensitive. (? s) for "single line mode" makes the dot match all characters, including line breaks.
I'd recommend using an HTML parser over a regex, but still here's a regex that will create a capturing group over the value of the href
attribute of each links. It will match whether double or single quotes are used.
<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1
You can view a full explanation of this regex at here.
Snippet playground:
const linkRx = /<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1/; const textToMatchInput = document.querySelector('[name=textToMatch]'); document.querySelector('button').addEventListener('click', () => { console.log(textToMatchInput.value.match(linkRx)); });
<label> Text to match: <input type="text" name="textToMatch" value='<a href="google.com"'> <button>Match</button> </label>
Using regex
to parse html is not recommended
regex
is used for regularly occurring patterns.html
is not regular with it's format(except xhtml
).For example html
files are valid even if you don't have a closing tag
!This could break your code.
Use an html parser like htmlagilitypack
You can use this code to retrieve all href's
in anchor tag using HtmlAgilityPack
HtmlDocument doc = new HtmlDocument(); doc.Load(yourStream); var hrefList = doc.DocumentNode.SelectNodes("//a") .Select(p => p.GetAttributeValue("href", "not found")) .ToList();
hrefList
contains all href`s
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With