I'm working on a routine to strip block or line comments from some C# code. I have looked at the other examples on the site, but haven't found the exact answer that I'm looking for.
I can match block comments (/* comment */) in their entirety using this regular expression with RegexOptions.Singleline:
(/\*[\w\W]*\*/)
And I can match line comments (// comment) in their entirety using this regular expression with RegexOptions.Multiline:
(//((?!\*/).)*)(?!\*/)[^\r\n]
Note: I'm using [^\r\n]
instead of $
because $
is including \r
in the match, too.
However, this doesn't quite work the way I want it to.
Here is my test code that I'm matching against:
// remove whole line comments bool broken = false; // remove partial line comments if (broken == true) { return "BROKEN"; } /* remove block comments else { return "FIXED"; } // do not remove nested comments */ bool working = !broken; return "NO COMMENT";
The block expression matches
/* remove block comments else { return "FIXED"; } // do not remove nested comments */
which is fine and good, but the line expression matches
// remove whole line comments // remove partial line comments
and
// do not remove nested comments
Also, if I do not have the */ positive lookahead in the line expression twice, it matches
// do not remove nested comments *
which I really don't want.
What I want is an expression that will match characters, starting with //
, to the end of line, but does not contain */
between the //
and end of line.
Also, just to satisfy my curiosity, can anyone explain why I need the lookahead twice? (//((?!\*/).)*)[^\r\n]
and (//(.)*)(?!\*/)[^\r\n]
will both include the *, but (//((?!\*/).)*)(?!\*/)[^\r\n]
and (//((?!\*/).)*(?!\*/))[^\r\n]
won't.
Both of your regular expressions (for block and line comments) have bugs. If you want I can describe the bugs, but I felt it’s perhaps more productive if I write new ones, especially because I’m intending to write a single one that matches both.
The thing is, every time you have /*
and //
and literal strings “interfering” with each other, it is always the one that starts first that takes precedence. That’s very convenient because that’s exactly how regular expressions work: find the first match first.
So let’s define a regular expression that matches each of those four tokens:
var blockComments = @"/\*(.*?)\*/"; var lineComments = @"//(.*?)\r?\n"; var strings = @"""((\\[^\n]|[^""\n])*)"""; var verbatimStrings = @"@(""[^""]*"")+";
To answer the question in the title (strip comments), we need to:
Regex.Replace
can do this easily using a MatchEvaluator function:
string noComments = Regex.Replace(input, blockComments + "|" + lineComments + "|" + strings + "|" + verbatimStrings, me => { if (me.Value.StartsWith("/*") || me.Value.StartsWith("//")) return me.Value.StartsWith("//") ? Environment.NewLine : ""; // Keep the literal strings return me.Value; }, RegexOptions.Singleline);
I ran this code on all the examples that Holystream provided and various other cases that I could think of, and it works like a charm. If you can provide an example where it fails, I am happy to adjust the code for you.
You could tokenize the code with an expression like:
@(?:"[^"]*")+|"(?:[^"\n\\]+|\\.)*"|'(?:[^'\n\\]+|\\.)*'|//.*|/\*(?s:.*?)\*/
It would also match some invalid escapes/structures (eg. 'foo'
), but will probably match all valid tokens of interest (unless I forgot something), thus working well for valid code.
Using it in a replace and capturing the parts you want to keep will give you the desired result. I.e:
static string StripComments(string code) { var re = @"(@(?:""[^""]*"")+|""(?:[^""\n\\]+|\\.)*""|'(?:[^'\n\\]+|\\.)*')|//.*|/\*(?s:.*?)\*/"; return Regex.Replace(code, re, "$1"); }
Example app:
using System; using System.Text.RegularExpressions; namespace Regex01 { class Program { static string StripComments(string code) { var re = @"(@(?:""[^""]*"")+|""(?:[^""\n\\]+|\\.)*""|'(?:[^'\n\\]+|\\.)*')|//.*|/\*(?s:.*?)\*/"; return Regex.Replace(code, re, "$1"); } static void Main(string[] args) { var input = "hello /* world */ oh \" '\\\" // ha/*i*/\" and // bai"; Console.WriteLine(input); var noComments = StripComments(input); Console.WriteLine(noComments); } } }
Output:
hello /* world */ oh " '\" // ha/*i*/" and // bai hello oh " '\" // ha/*i*/" and
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With