Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What regex will match text excluding what lies within HTML tags?

Tags:

c#

regex

I am writing code for a search results page that needs to highlight search terms. The terms happen to occur within table cells (the app is iterating through GridView Row Cells), and these table cells may have HTML.

Currently, my code looks like this (relevant hunks shown below):

const string highlightPattern = @"<span class=""Highlight"">$0</span>";
DataBoundLiteralControl litCustomerComments = (DataBoundLiteralControl)e.Row.Cells[CUSTOMERCOMMENTS_COLUMN].Controls[0];

// Turn "term1 term2" into "(term1|term2)"
string spaceDelimited = txtTextFilter.Text.Trim();
string pipeDelimited = string.Join("|", spaceDelimited.Split(new[] {" "}, StringSplitOptions.RemoveEmptyEntries));
string searchPattern = "(" + pipeDelimited + ")";

// Highlight search terms in Customer - Comments column
e.Row.Cells[CUSTOMERCOMMENTS_COLUMN].Text = Regex.Replace(litCustomerComments.Text, searchPattern, highlightPattern, RegexOptions.IgnoreCase);

Amazingly it works. BUT, sometimes the text I am matching on is HTML that looks like this:

<span class="CustomerName">Fred</span> was a classy individual.

And if you search for "class" I want the highlight code to wrap the "class" in "classy" but of course not the HTML attribute "class" that happens to be in there! If you search for "Fred", that should be highlighted.

So what's a good regex that will make sure matches happen only OUTSIDE the html tags? It doesn't have to be super hardcore. Simply making sure the match is not between < and > would work fine, I think.

like image 484
Chris Avatar asked Oct 07 '08 18:10

Chris


2 Answers

This regex should do the job : (?<!<[^>]*)(regex you want to check: Fred|span) It checks that it is impossible to match the regex <[^>]* going backward starting from a matching string.

Modified code below:

const string notInsideBracketsRegex = @"(?<!<[^>]*)";
const string highlightPattern = @"<span class=""Highlight"">$0</span>";
DataBoundLiteralControl litCustomerComments = (DataBoundLiteralControl)e.Row.Cells[CUSTOMERCOMMENTS_COLUMN].Controls[0];

// Turn "term1 term2" into "(term1|term2)"
string spaceDelimited = txtTextFilter.Text.Trim();
string pipeDelimited = string.Join("|", spaceDelimited.Split(new[] {" "}, StringSplitOptions.RemoveEmptyEntries));
string searchPattern = "(" + pipeDelimited + ")";
searchPattern = notInsideBracketsRegex + searchPattern;

// Highlight search terms in Customer - Comments column
e.Row.Cells[CUSTOMERCOMMENTS_COLUMN].Text = Regex.Replace(litCustomerComments.Text, searchPattern, highlightPattern, RegexOptions.IgnoreCase);
like image 55
Julien Hoarau Avatar answered Nov 15 '22 01:11

Julien Hoarau


You can use a regex with balancing groups and backreferences, but I strongly recommend that you use a parser here.

like image 24
Santiago Palladino Avatar answered Nov 14 '22 23:11

Santiago Palladino