Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

preg_replace all links in file_get_contents not containing a word [duplicate]

I'm reading a page into a variable and I would like to disable all links that do not contain the word "remedy" in the address. The code I have so far grabs all the links including ones with "remedy". What am I doing wrong?

$page = preg_replace('~<a href=".*?(?!remedy).*?".*?>(.*?)</a>~i', '<font color="#808080">$1</font>', $page);

-- solution --

$page = preg_replace('~<a href="(.(?!remedy))*?".*?>(.*?)</a>~i', '<font color="#808080">$2</font>', $page);
like image 465
user2001487 Avatar asked Oct 05 '22 03:10

user2001487


2 Answers

Try ~<a href="(.(?!remedy))*?".*?>(.*?)</a>~i

To the question, what you are doing wrong: Regexes match ever if anyhow possible and for each url (even that containing remedy) it is possible to match '~<a href=".*?(?!remedy).*?".*?>(.*?)</a>~i' because you did not specify remedy may not be contained anywhere in the attribute but you specified there must be anything/nothing (.*?) that is not followed by remedy and that is the case for any url except those that begin with exactly <a href="remedy". Hope one can understand that...

like image 184
Matmarbon Avatar answered Oct 13 '22 11:10

Matmarbon


I would probably use this:

<a href="(?:(?!remedy)[^"])*"[^>]*>([^<]*)</a>

The most interesting part is this:

"(?:(?!remedy)[^"])*"

Each time the [^"] is about to consume another character, it yields to the lookahead so it confirm that it's not the first character of the word remedy. Using [^"] instead of . prevents it from looking at anything beyond the closing quote. I also took the liberty of replacing your .*?s with negated character classes. This serves the same purpose, keeping the match "corralled" in the area where you want it to match. It's also more efficient and more robust.

Of course, I'm assuming the <a> element's content is plain text, with no more elements nested inside it. In fact, that's just one of many simplifying assumptions I've made. You can't match HTML with regexes without them.

like image 30
Alan Moore Avatar answered Oct 13 '22 11:10

Alan Moore