Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove simple HTML-Tags from String in Oracle via RegExp, Explanation needed

I do not understand, why my columns reg1 and reg2 remove "bbb" from my string, and only reg3 works as expected.

WITH t AS (SELECT 'aaa <b>bbb</b> ccc' AS teststring FROM dual)

SELECT
  teststring,
  regexp_replace(teststring, '<.+>') AS reg1,
  regexp_replace(teststring, '<.*>') AS reg2,
  regexp_replace(teststring, '<.*?>') AS reg3
FROM t


TESTSTRING             REG1        REG2          REG3
aaa <b>bbb</b> ccc     aaa ccc     aaa ccc       aaa bbb ccc

Thanks a lot!

like image 293
Basti Avatar asked Jun 10 '15 12:06

Basti


1 Answers

Because regex is greedy by default. I.e. the expressions .* or .+ try to take as many characters as possible. Therefore <.+> will span from the first < to the last >. Make it lazy by using the lazy operator ?:

regexp_replace(teststring, '<.+?>')

or

regexp_replace(teststring, '<.*?>')

Now, the search for > will stop at the first > encountered.

Note that . includes > as well, therefore the greedy variant (without ?) swallows all the > but the last.

like image 143
Olivier Jacot-Descombes Avatar answered Sep 22 '22 02:09

Olivier Jacot-Descombes