Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regular expressions - match all anchors with optional attributes

Tags:

regex

php

I have a wysiwyg editor in my back end, and it is tripping up the first regular expression I wrote. This is in PHP4, using preg_replace(). I'm capturing the URI and linked text.

@<a\shref=\"http[s]?://([^\"]*)\"[]>(.*)<\/a>@siU

The client wanted all external links to open in a new window, so that's the expression I was using to find all (hopefully) external links, but leave internal, page anchor links, etc

I realised the wysiwyg editor also adds style="font-weight: bold" if the user selects bold on the link. I've only recently started learning regular expressions so I'm unsure how to go about this problem.

How would I do it?

like image 985
alex Avatar asked Oct 27 '08 01:10

alex


People also ask

What does \\ mean in regex?

\\. matches the literal character . . the first backslash is interpreted as an escape character by the Emacs string reader, which combined with the second backslash, inserts a literal backslash character into the string being read. the regular expression engine receives the string \. html?\ ' .

What will the regular expression match?

By default, regular expressions will match any part of a string. It's often useful to anchor the regular expression so that it matches from the start or end of the string: ^ matches the start of string. $ matches the end of the string.

What is the difference between .*? And * regular expressions?

*1 , * is greedy - it will match all the way to the end, and then backtrack until it can match 1 , leaving you with 1010000000001 . . *? is non-greedy. * will match nothing, but then will try to match extra characters until it matches 1 , eventually matching 101 .

What are anchors in regular expressions?

Anchors belong to the family of regex tokens that don't match any characters, but that assert something about the string or the matching process. Anchors assert that the engine's current position in the string matches a well-determined location: for instance, the beginning of the string, or the end of a line.


1 Answers

this should match it alright:

/<a\s+([^>]*)href="https?:\/\/([^"]*)"(.*?)>(.*?)<\/a>/

The useful thing here is the lazy match. *? it means that it'll match only as much as it absolutely needs to, as opposed to the regular match, which is greedy.

To demonstrate, with this text:

a b c d a b c d

these regexes will have different results:

/a.*c/    selects: "a b c d a b c"
/a.*?c/   selects: "a b c"
like image 72
nickf Avatar answered Sep 27 '22 16:09

nickf