Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to extract hyperlink containing a specific word

Tags:

regex

I need to extract a hyperlink, containing a specific word in the url, from a piece of text. Example;

"This is a text with a link to some page. Click this link <a href="/server/specificword.htm>this is a link to a page</a> to see that page. Here is a link that doesn't have the word "specificword" in it: <a href="/server/mypage.htm>this is a link without the word "specificword" in the url</a>"

So, I need to parse this text, check the hyperlinks to see if one of them contains the word "specificword", and then extract the entire hyperlink. I would then end up with this:

<a href="/server/specificword.htm>this is a link to a page</a>

I need the hyperlink that has specificword in the url eg. /server/specificword.htm, not in the link text

One regex I have tried, is this one: /(<a[^>]*>.*?</a>)|specificword/ This will match all hyperlinks in the text, or "specificword". If the text has multiple links, without the word "specificword", I will get those too.

Also, I have tried this one, but it matces nothing:

<a.*?href\s*=\s*["\']([^"\'>]*specificword[^"\'>]*)["\'][^>]*>.*?<\/a>

My regex skills end here, any help would be great....

like image 409
Soeren Avatar asked Apr 19 '13 08:04

Soeren


2 Answers

try this for all the a tag:

/<a [^>]*\bhref\s*=\s*"[^"]*SPECIFICWORD.*?<\/a>/

or just for the link (in the first capture group):

/<a [^>]*\bhref\s*=\s*"([^"]*SPECIFICWORD[^"]*)/

If you use php, for the link:

preg_match_all('/<a [^>]*\bhref\s*=\s*"\K[^"]*SPECIFICWORD[^"]*/', $text, $results);
like image 187
Casimir et Hippolyte Avatar answered Oct 19 '22 07:10

Casimir et Hippolyte


This one should suit your needs:

<a href="[^"]*?specificword.*?">.*?</a>

Demo


If you want to allow other attributes on your anchor tar, and be more premissive about inner spaces, you could try:

<a( [^>]*?)? href="[^"]*?specificword.*?"( .*?)?>.*?</a>

Demo


You could also of course use non-capturing groups (?:...):

<a(?: [^>]*?)? href="[^"]*?specificword.*?"(?: .*?)?>.*?</a>

Demo


And finally, if you want to allow simple quotes for your href attribute:

<a(?: [^>]*?)? href=(["'])[^\1]*?specificword.*?\1(?: .*?)?>.*?</a>

Demo


Last but not least: if you want to capture the URL, just put parentheses around the [^\1]*?specificword.*? part:

<a(?: [^>]*?)? href=(["'])([^\1]*?specificword.*?)\1(?: .*?)?>.*?</a>

Demo

like image 27
sp00m Avatar answered Oct 19 '22 09:10

sp00m