Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression to remove <br> from <pre>

Tags:

html

c#

regex

I am trying to remove the <br /> tags that appear in between the <pre></pre> tags. My string looks like

string str = "Test<br/><pre><br/>Test<br/></pre><br/>Test<br/>---<br/>Test<br/><pre><br/>Test<br/></pre><br/>Test"

string temp = "`##`";
while (Regex.IsMatch(result, @"\<pre\>(.*?)\<br\>(.*?)\</pre\>", RegexOptions.IgnoreCase))
{
    result = System.Text.RegularExpressions.Regex.Replace(result, @"\<pre\>(.*?)\<br\>(.*?)\</pre\>", "<pre>$1" + temp + "$2</pre>", RegexOptions.IgnoreCase);
}
str = str.Replace(temp, System.Environment.NewLine);

But this replaces all <br> tags between first and the last <pre> in the whole text. Thus my final outcome is:

str = "Test<br/><pre>\r\nTest\r\n</pre>\r\nTest\r\n---\r\nTest\r\n<pre>\r\nTest\r\n</pre><br/>Test"

I expect my outcome to be

str = "Test<br/><pre>\r\nTest\r\n</pre><br/>Test<br/>---<br/>Test<br/><pre>\r\nTest\r\n</pre><br/>Test"
like image 941
Ashish Avatar asked Aug 13 '10 06:08

Ashish


1 Answers

If you are parsing whole HTML pages, RegEx is not a good choice - see here for a good demonstration of why.

Use an HTML parser such as the HTML Agility Pack for this kind of work. It also works with fragments like the one you posted.

like image 179
Oded Avatar answered Sep 22 '22 02:09

Oded