Regex to strip line comments from C#

Question

I'm working on a routine to strip block or line comments from some C# code. I have looked at the other examples on the site, but haven't found the exact answer that I'm looking for.

I can match block comments (/* comment */) in their entirety using this regular expression with RegexOptions.Singleline:

(/\*[\w\W]*\*/)

And I can match line comments (// comment) in their entirety using this regular expression with RegexOptions.Multiline:

(//((?!\*/).)*)(?!\*/)[^ ]

Note: I'm using [^ ] instead of $ because $ is including in the match, too.

However, this doesn't quite work the way I want it to.

Here is my test code that I'm matching against:

// remove whole line comments bool broken = false; // remove partial line comments if (broken == true) {     return "BROKEN"; } /* remove block comments else {     return "FIXED"; } // do not remove nested comments */ bool working = !broken; return "NO COMMENT";

The block expression matches

/* remove block comments else {     return "FIXED"; } // do not remove nested comments */

which is fine and good, but the line expression matches

// remove whole line comments // remove partial line comments

and

// do not remove nested comments

Also, if I do not have the */ positive lookahead in the line expression twice, it matches

// do not remove nested comments *

which I really don't want.

What I want is an expression that will match characters, starting with //, to the end of line, but does not contain */ between the // and end of line.

Also, just to satisfy my curiosity, can anyone explain why I need the lookahead twice? (//((?!\*/).)*)[^ ] and (//(.)*)(?!\*/)[^ ] will both include the *, but (//((?!\*/).)*)(?!\*/)[^ ] and (//((?!\*/).)*(?!\*/))[^ ] won't.

Timwi · Accepted Answer

Both of your regular expressions (for block and line comments) have bugs. If you want I can describe the bugs, but I felt it’s perhaps more productive if I write new ones, especially because I’m intending to write a single one that matches both.

The thing is, every time you have /* and // and literal strings “interfering” with each other, it is always the one that starts first that takes precedence. That’s very convenient because that’s exactly how regular expressions work: find the first match first.

So let’s define a regular expression that matches each of those four tokens:

var blockComments = @"/\*(.*?)\*/"; var lineComments = @"//(.*?)\r?\n"; var strings = @"""((\[^\n]|[^""\n])*)"""; var verbatimStrings = @"@(""[^""]*"")+";

To answer the question in the title (strip comments), we need to:

Replace the block comments with nothing
Replace the line comments with a newline (because the regex eats the newline)
Keep the literal strings where they are.

Regex.Replace can do this easily using a MatchEvaluator function:

string noComments = Regex.Replace(input,     blockComments + "|" + lineComments + "|" + strings + "|" + verbatimStrings,     me => {         if (me.Value.StartsWith("/*") || me.Value.StartsWith("//"))             return me.Value.StartsWith("//") ? Environment.NewLine : "";         // Keep the literal strings         return me.Value;     },     RegexOptions.Singleline);

I ran this code on all the examples that Holystream provided and various other cases that I could think of, and it works like a charm. If you can provide an example where it fails, I am happy to adjust the code for you.

Qtax · Answer

You could tokenize the code with an expression like:

@(?:"[^"]*")+|"(?:[^"
\]+|\.)*"|'(?:[^'
\]+|\.)*'|//.*|/\*(?s:.*?)\*/

It would also match some invalid escapes/structures (eg. 'foo'), but will probably match all valid tokens of interest (unless I forgot something), thus working well for valid code.

Using it in a replace and capturing the parts you want to keep will give you the desired result. I.e:

static string StripComments(string code) {     var re = @"(@(?:""[^""]*"")+|""(?:[^""
\]+|\.)*""|'(?:[^'
\]+|\.)*')|//.*|/\*(?s:.*?)\*/";     return Regex.Replace(code, re, "$1"); }

Example app:

using System; using System.Text.RegularExpressions;  namespace Regex01 {     class Program     {         static string StripComments(string code)         {             var re = @"(@(?:""[^""]*"")+|""(?:[^""
\]+|\.)*""|'(?:[^'
\]+|\.)*')|//.*|/\*(?s:.*?)\*/";             return Regex.Replace(code, re, "$1");         }          static void Main(string[] args)         {             var input = "hello /* world */ oh \" '\\" // ha/*i*/\" and // bai";             Console.WriteLine(input);              var noComments = StripComments(input);             Console.WriteLine(noComments);         }     } }

Output:

hello /* world */ oh " '\" // ha/*i*/" and // bai hello  oh " '\" // ha/*i*/" and

Regex to strip line comments from C#

Tags:

c#

.net

regex

Welton v3.61

2 Answers

Timwi

Qtax

Recent Activity

Donate For Us

Regex to strip line comments from C#

Tags:

c#

.net

regex

Welton v3.61

2 Answers

Timwi

Qtax

Related questions

Recent Activity

Donate For Us