Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to replace all \n in a String, but no those inside [code] [/code] tag

Tags:

java

regex

I need help to replace all \n (new line) caracters for
in a String, but not those \n inside [code][/code] tags. My brain is burning, I can't solve this by my own :(

Example:

test test test
test test test
test
test

[code]some
test
code
[/code]

more text

Should be:

test test test<br />
test test test<br />
test<br />
test<br />
<br />
[code]some
test
code
[/code]<br />
<br />
more text<br />

Thanks for your time. Best regards.

like image 468
Tute Avatar asked Nov 30 '08 03:11

Tute


2 Answers

I would suggest a (simple) parser, and not a regular expression. Something like this (bad pseudocode):

stack elementStack;

foreach(char in string) {
    if(string-from-char == "[code]") {
        elementStack.push("code");
        string-from-char = "";
    }

    if(string-from-char == "[/code]") {
        elementStack.popTo("code");
        string-from-char = "";
    }

    if(char == "\n" && !elementStack.contains("code")) {
        char = "<br/>\n";
    }
}
like image 118
strager Avatar answered Nov 11 '22 23:11

strager


You've tagged the question regex, but this may not be the best tool for the job.

You might be better using basic compiler building techniques (i.e. a lexer feeding a simple state machine parser).

Your lexer would identify five tokens: ("[code]", '\n', "[/code]", EOF, :all other strings:) and your state machine looks like:

state    token    action
------------------------
begin    :none:   --> out
out      [code]   OUTPUT(token), --> in
out      \n       OUTPUT(break), OUTPUT(token)
out      *        OUTPUT(token)
in       [/code]  OUTPUT(token), --> out
in       *        OUTPUT(token)
*        EOF      --> end

EDIT: I see other poster discussing the possible need for nesting the blocks. This state machine won't handle that. For nesting blocks, use a recursive decent parser (not quite so simple but still easy enough and extensible).

EDIT: Axeman notes that this design excludes the use of "[/code]" in the code. An escape mechanism can be used to beat this. Something like add '\' to your tokens and add:

state    token    action
------------------------
in       \        -->esc-in
esc-in   *        OUTPUT(token), -->in
out      \        -->esc-out
esc-out  *        OUTPUT(token), -->out

to the state machine.

The usual arguments in favor of machine generated lexers and parsers apply.

like image 29
dmckee --- ex-moderator kitten Avatar answered Nov 11 '22 22:11

dmckee --- ex-moderator kitten