Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# How to delete XML/HTML comments with regular expression

Tags:

c#

regex

The fragment below doesn't work for me.

fragment = Regex.Replace(fragment, "<!--.*?-->", String.Empty , RegexOptions.Multiline  );
like image 831
MicMit Avatar asked Aug 20 '09 05:08

MicMit


People also ask

What C is used for?

C programming language is a machine-independent programming language that is mainly used to create many types of applications and operating systems such as Windows, and other complicated programs such as the Oracle database, Git, Python interpreter, and games and is considered a programming foundation in the process of ...

What is the full name of C?

In the real sense it has no meaning or full form. It was developed by Dennis Ritchie and Ken Thompson at AT&T bell Lab. First, they used to call it as B language then later they made some improvement into it and renamed it as C and its superscript as C++ which was invented by Dr.

Is C language easy?

C is a general-purpose language that most programmers learn before moving on to more complex languages. From Unix and Windows to Tic Tac Toe and Photoshop, several of the most commonly used applications today have been built on C. It is easy to learn because: A simple syntax with only 32 keywords.

Is C programming hard?

C is more difficult to learn than JavaScript, but it's a valuable skill to have because most programming languages are actually implemented in C. This is because C is a “machine-level” language. So learning it will teach you how a computer works and will actually make learning new languages in the future easier.


2 Answers

Change it to RegExOptions.Singleline instead and it'll work just fine. When not in Singleline mode, the dot matches any character, except newline.

Note that Singleline and Multiline are not mutually exclusive. They do two separate things. To quote MSDN:

Multiline mode. Changes the meaning of ^ and $ so they match at the beginning and end, respectively, of any line, and not just the beginning and end of the entire string.

Single-line mode. Changes the meaning of the dot (.) so it matches every character (instead of every character except \n).

Other people have already suggested the HTML Agility Pack. I just felt you should have an explanation on why your Regex wouldn't work :)

like image 133
Thorarin Avatar answered Sep 18 '22 16:09

Thorarin


Please don't use regular expressions to work with markup languages - you need to use a better tool that is built for that kind of job.

Use the Html Agiliy Pack instead. I even found this article in which a reader (named Simon Mourier) comments with a function that uses the Html Agility Pack to remove comments from a document:

Simon Mourier said:

This is a sample code to remove comments:

static void Main(string[] args) 
{ 
  HtmlDocument doc = new HtmlDocument(); 
  doc.Load("filewithcomments.htm"); 
  doc.Save(Console.Out); // show before 
  RemoveComments(doc.DocumentNode); 
  doc.Save(Console.Out); // show after 
} 

static void RemoveComments(HtmlNode node)
{
    if (!node.HasChildNodes)
    {
        return;
    }

    for (int i=0; i<node.ChildNodes.Count; i++)
    {
        if (node.ChildNodes[i].NodeType == HtmlNodeType.Comment)
        {
            node.ChildNodes.RemoveAt(i);
            --i;
        }
    }

    foreach (HtmlNode subNode in node.ChildNodes)
    {
        RemoveComments(subNode);
    }
}
like image 36
Andrew Hare Avatar answered Sep 21 '22 16:09

Andrew Hare