I have the following string:
Lorem ipsum Test dolor sit amet, consetetur sadipscing elitr, sed diam nonumy <a href="http://Test.com/url">Test</a> eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd sed Test dolores et ea rebum. Stet clita kasd gubergren, no sea <a href="http://url.com">Test xyz</a> takimata sanctus est Lorem ipsum dolor sit amet.
Now I would replace the string 'Test' outside of tags an not between tags (e.g. replaced with '1234').
Lorem ipsum 1234 dolor sit amet, consetetur sadipscing elitr, sed diam nonumy <a href="http://Test.com/url">Test</a> eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd sed 1234 dolores et ea rebum. Stet clita kasd gubergren, no sea <a href="http://url.com">Test xyz</a> takimata sanctus est Lorem ipsum dolor sit amet.
I started with this regex: (?!<a[^>]*>)(Test)([^<])(?!</a>)
But two problems are not solved:
<a href="http://Test.com/url">
)<a href="http://url">Test xyz</a>
)I hope someone has a solution to solve this problem.
Answer
Use
(Test)(?!(.(?!<a))*</a>)
Explanation
Let me remind you of the meaning of some symbols:
1) ?!
is a negative lookahead, for example r(?!d)
selects all r
that are not directly followed by an d
:
2) Therefore never start a negative lookahead without a character. Just (?!d)
is meaningless:
3) The ?
can be used as a lazy match. For example .+E
would select from
123EEE
the whole string 123EEE
. However, .+?E
selects as few "any charater" (.+
) as needed. It would only select 123E
.
Answer:
Protist answer is that you should use (?!<a[^>]*?>)(Test)(?![^<]*?</a>)
. Let me explain how to make this shorter first.
As mentioned in 2), it is meaningless to put a lookahead before the match. So the following is equivalent to protist answer:
(Test)(?![^<]*?</a>)
also since <
is not allowed, the lazy match ?
is superfluous, so its also equivalent to
(Test)(?![^<]*</a>)
This selects all Test
that are not followed by an </a>
without the symbol <
in between. This is why Test which appears before or after any <a ...> .. </a>
will be replaced.
However, note that
Lorem Test dolor <a href="http://Test.com/url">Test <strong>dolor</strong></a> eirmod
would be changed to
Lorem 1234 dolor <a href="http://1234.com/url">1234 <strong>dolor</strong></a> eirmod
In order to catch that you could change your regex to
(Test)(?!(.(?!<a))*</a>)
which does the following:
Select every word
Test
that is not followed by a string***</a>
where each character in***
is not followed by<a
.
Note that the dot .
is important (see 2)).
Note that a lazy match like (Test)(?!(.(?!<a))*?</a>)
is not relevant because nested links are illegal in HTML4 and HTML5 (smth like <a href="#">..<a href="#">...</a>..</a>)
.
protist said
Also, using regexes on raw HTML is not recommended.
I agree with that. A problem is that it would cause problems if a tag is not closed or opened. For example all mentioned solutions here would change
Lorem Test dolor Test <strong>dolor</strong></a> eirmod
to
Lorem Test dolor Test <strong>dolor</strong></a> eirmod 1234 dolores sea 1234 takimata
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With