Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex replace text but exclude when text is between specific tag

Tags:

regex

I have the following string:

Lorem ipsum Test dolor sit amet, consetetur sadipscing elitr, sed diam nonumy <a href="http://Test.com/url">Test</a> eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd sed Test dolores et ea rebum. Stet clita kasd gubergren, no sea <a href="http://url.com">Test xyz</a> takimata sanctus est Lorem ipsum dolor sit amet.

Now I would replace the string 'Test' outside of tags an not between tags (e.g. replaced with '1234').

Lorem ipsum 1234 dolor sit amet, consetetur sadipscing elitr, sed diam nonumy <a href="http://Test.com/url">Test</a> eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd sed 1234 dolores et ea rebum. Stet clita kasd gubergren, no sea <a href="http://url.com">Test xyz</a> takimata sanctus est Lorem ipsum dolor sit amet.

I started with this regex: (?!<a[^>]*>)(Test)([^<])(?!</a>)

But two problems are not solved:

  1. The text 'Test' gets also replaced inside Tags (e.g. <a href="http://Test.com/url">)
  2. Does the text between the tag not exactly match the searched text, it will be also replaced (e.g. <a href="http://url">Test xyz</a>)

I hope someone has a solution to solve this problem.

like image 872
Weri Avatar asked Sep 19 '12 10:09

Weri


1 Answers

Answer

Use

(Test)(?!(.(?!<a))*</a>)

Explanation

Let me remind you of the meaning of some symbols:

1) ?! is a negative lookahead, for example r(?!d) selects all r that are not directly followed by an d:

enter image description here

2) Therefore never start a negative lookahead without a character. Just (?!d) is meaningless:

enter image description here

3) The ? can be used as a lazy match. For example .+E would select from

123EEE

the whole string 123EEE. However, .+?E selects as few "any charater" (.+) as needed. It would only select 123E.

Answer:

Protist answer is that you should use (?!<a[^>]*?>)(Test)(?![^<]*?</a>). Let me explain how to make this shorter first.

As mentioned in 2), it is meaningless to put a lookahead before the match. So the following is equivalent to protist answer:

(Test)(?![^<]*?</a>)

also since < is not allowed, the lazy match ? is superfluous, so its also equivalent to

(Test)(?![^<]*</a>)

This selects all Test that are not followed by an </a> without the symbol < in between. This is why Test which appears before or after any <a ...> .. </a> will be replaced.

However, note that

Lorem Test dolor <a href="http://Test.com/url">Test <strong>dolor</strong></a> eirmod

would be changed to

Lorem 1234 dolor <a href="http://1234.com/url">1234 <strong>dolor</strong></a> eirmod 

In order to catch that you could change your regex to

(Test)(?!(.(?!<a))*</a>)

which does the following:

Select every word Test that is not followed by a string ***</a> where each character in *** is not followed by <a.

Note that the dot . is important (see 2)).

Note that a lazy match like (Test)(?!(.(?!<a))*?</a>) is not relevant because nested links are illegal in HTML4 and HTML5 (smth like <a href="#">..<a href="#">...</a>..</a>).

protist said

Also, using regexes on raw HTML is not recommended.

I agree with that. A problem is that it would cause problems if a tag is not closed or opened. For example all mentioned solutions here would change

Lorem Test dolor Test <strong>dolor</strong></a> eirmod

to

Lorem Test dolor Test <strong>dolor</strong></a> eirmod 1234 dolores sea 1234 takimata 
like image 115
Adam Avatar answered Sep 21 '22 14:09

Adam