Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Highlight a list of words using a regular expression in c#

Tags:

c#

regex

I have some site content that contains abbreviations. I have a list of recognised abbreviations for the site, along with their explanations. I want to create a regular expression which will allow me to replace all of the recognised abbreviations found in the content with some markup.

For example:

content:

This is just a little test of the memb to see if it gets picked up. 
Deb of course should also be caught here.

abbreviations:

memb = Member; deb = Debut; 

result:

This is just a little test of the [a title="Member"]memb[/a] to see if it gets picked up. 
[a title="Debut"]Deb[/a] of course should also be caught here.

(This is just example markup for simplicity).

Thanks.

EDIT:

CraigD's answer is nearly there, but there are issues. I only want to match whole words. I also want to keep the correct capitalisation of each word replaced, so that deb is still deb, and Deb is still Deb as per the original text. For example, this input:

This is just a little test of the memb. 
And another memb, but not amemba. 
Deb of course should also be caught here.deb!
like image 202
David Conlisk Avatar asked Dec 02 '22 08:12

David Conlisk


1 Answers

First you would need to Regex.Escape() all the input strings.

Then you can look for them in the string, and iteratively replace them by the markup you have in mind:

string abbr      = "memb";
string word      = "Member";
string pattern   = String.Format("\b{0}\b", Regex.Escape(abbr));
string substitue = String.Format("[a title=\"{0}\"]{1}[/a]", word, abbr);
string output    = Regex.Replace(input, pattern, substitue);

EDIT: I asked if a simple String.Replace() wouldn't be enough - but I can see why regex is desirable: you can use it to enforce "whole word" replacements only by making a pattern that uses word boundary anchors.

You can go as far as building a single pattern from all your escaped input strings, like this:

\b(?:{abbr_1}|{abbr_2}|{abbr_3}|{abbr_n})\b

and then using a match evaluator to find the right replacement. This way you can avoid iterating the input string more than once.

like image 116
Tomalak Avatar answered Dec 04 '22 21:12

Tomalak