<p>I need a regex pattern for finding web page links in HTML.</p> <p>I first use <code>@"(<a.*?>.*?</a>)"</code> to extract links (<code><a></code>), but I can't fetch <code>href</code> from that.</p> <p>My strings are:</p> <ol> <li><code><a href="www.example.com/page.php?id=xxxx&name=yyyy" ....></a></code></li> <li><code><a href="http://www.example.com/page.php?id=xxxx&name=yyyy" ....></a></code></li> <li><code><a href="https://www.example.com/page.php?id=xxxx&name=yyyy" ....></a></code></li> <li><code><a href="www.example.com/page.php/404" ....></a></code></li> </ol> <p>1, 2 and 3 are valid and I need them, but number 4 is not valid for me (<code>?</code> and <code>=</code> is essential)</p> <hr> <p>Thanks everyone, but I don't need parsing <code><a></code>. I have a list of links in <code>href="abcdef"</code> format.</p> <p>I need to fetch <code>href</code> of the links and filter it, my favorite urls must be contain <code>?</code> and <code>=</code> like <code>page.php?id=5</code> </p> <p>Thanks!</p>

<p>I'd recommend using an HTML parser over a regex, but still here's a regex that will create a capturing group over the value of the <code>href</code> attribute of each links. It will match whether double or single quotes are used.</p> <pre class="prettyprint"><code><a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1 </code></pre> <p>You can view a full explanation of this regex at here.</p> <p>Snippet playground:</p> <p></p> <div class="snippet" data-lang="js" data-hide="false" data-console="true" data-babel="false"> <div class="snippet-code"> <pre class="prettyprint snippet-code-js lang-js prettyprint-override"><code>const linkRx = /<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1/; const textToMatchInput = document.querySelector('[name=textToMatch]'); document.querySelector('button').addEventListener('click', () => { console.log(textToMatchInput.value.match(linkRx)); });</code></pre> <pre class="prettyprint snippet-code-html lang-html prettyprint-override"><code><label> Text to match: <input type="text" name="textToMatch" value='<a href="google.com"'> <button>Match</button> </label></code></pre> </div> </div>

<p>Using <code>regex</code> to parse html is not recommended</p> <p><code>regex</code> is used for regularly occurring patterns.<code>html</code> is not regular with it's format(except <code>xhtml</code>).For example <code>html</code> files are valid even if you <strong>don't</strong> have a <code>closing tag</code>!This could break your code.</p> <p>Use an html parser like htmlagilitypack</p> <p>You can use this code to retrieve all <code>href's</code> in anchor tag using <code>HtmlAgilityPack</code></p> <pre class="prettyprint"><code>HtmlDocument doc = new HtmlDocument(); doc.Load(yourStream); var hrefList = doc.DocumentNode.SelectNodes("//a") .Select(p => p.GetAttributeValue("href", "not found")) .ToList(); </code></pre> <p><code>hrefList</code> contains all href`s </p>

regular expression for finding 'href' value of a <a> link

2 Answers

I'd recommend using an HTML parser over a regex, but still here's a regex that will create a capturing group over the value of the href attribute of each links. It will match whether double or single quotes are used.

<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1

You can view a full explanation of this regex at here.

Snippet playground:

const linkRx = /<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1/;  const textToMatchInput = document.querySelector('[name=textToMatch]');    document.querySelector('button').addEventListener('click', () => {    console.log(textToMatchInput.value.match(linkRx));  });

<label>    Text to match:    <input type="text" name="textToMatch" value='<a href="google.com"'>        <button>Match</button>   </label>

145

answered Sep 23 '22 06:09

plalx

Using regex to parse html is not recommended

regex is used for regularly occurring patterns.html is not regular with it's format(except xhtml).For example html files are valid even if you don't have a closing tag!This could break your code.

Use an html parser like htmlagilitypack

You can use this code to retrieve all href's in anchor tag using HtmlAgilityPack

HtmlDocument doc = new HtmlDocument(); doc.Load(yourStream);  var hrefList = doc.DocumentNode.SelectNodes("//a")                   .Select(p => p.GetAttributeValue("href", "not found"))                   .ToList();

hrefList contains all href`s

answered Sep 23 '22 06:09

Anirudha

Related questions
                            
                                NLog does not create a log file
                            
                                Array must contain 1 element
                            
                                Encode a FileStream to base64 with c#
                            
                                Cast to generic type in C#
                            
                                How do I use Optional Parameters in an ASP.NET MVC Controller
                            
                                How to include() nested child entity in linq
                            
                                Remove ClickOnce from a WinForms app
                            
                                C# - elegant way of partitioning a list?
                            
                                Boolean int conversion issue
                            
                                Does C# 4 optimize away namespaces in a manner that previous C# versions did not?
                            
                                How do I fill a bitmap with a solid color?
                            
                                Session.Clear() vs. Session.RemoveAll()
                            
                                Elegant way parsing URL
                            
                                How to SELECT a dropdown list item by value programmatically
                            
                                C# - using List<T>.Find() with custom objects
                            
                                Found conflicts between System.Net.Http
                            
                                Efficiently merge string arrays in .NET, keeping distinct values
                            
                                Calculate previous week's start and end date
                            
                                Cannot add System.Web.dll reference
                            
                                Last and LastOrDefault not supported

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

regular expression for finding 'href' value of a <a> link

Tags:

c#

regex

MrRolling

People also ask

2 Answers

plalx

Anirudha

Recent Activity

Donate For Us