Find all 'more or less than' characters which is not tags in xml

Question

I need a regex to find all '<' or '>' which is not xml-tags.

Example:

<tag1>W<E><E</tag1>Z<>S

Should find

<><<>

Example:

<tag1>W<E><E</E></tag1>Z<>S

Should find

<<>

So, any hits where '<' or '>' is not in a tag (yes we alos have al self-closing tags which should be taken into consideration :)

Edit #2: What I want to do in the end is to replace all matches with the html-encoded values.

Edit #3:

So what I want to do is from a text containing HTML with some additional tags (very few known tags) get all '<' and '>' which is not included in the tags.

Example (the bold ones I want to find so I can replace them with their encoded values):

<div>
  <a href="link">Link with < characters</a>
  <knownTag>Text with character ></knownTag>
  <knownTag>Text < again ></knownTag>
<div>

Result should be:

<div>
  <a href="link">Link with &lt; characters</a>
  <knownTag>Text with character &gt;</knownTag>
  <knownTag>Text &lt; again &gt;</knownTag>
<div>

Any idea on how to solve this problem?

csharptest.net · Accepted Answer

This can be done with regex; however, it's not as simple as you suggest. You will need to find valid tags and process them in order to make this work. It just so happens that I did this some time ago when writing a fast and lightwieght xml/html parser. The code is available at:

http://csharptest.net/browse/src/Library/Html/XmlLightParser.cs http://csharptest.net/browse/src/Library/Html/XmlLightInterfaces.cs

To use the parser, you will implement the defined interface IXmlLightReader from the later of the two source files. The following example produces your desired results, and also handles several other capabilities you did not mention, like CDATA sections, processing instructions, DTDs, etc.

class RegexForBadXml
{
    const string Input = "<?xml version=\"1.0\"?>
<div>
	<a href=\"link\">Link with < characters</a>
	<knownTag>Text with character > &and other &#BAD; stuff</knownTag>
	<knownTag>Text < again ></knownTag>
	<knownTag><![CDATA[ Text < again > ]]></knownTag>
<div>";

    private static void Main()
    {
        var output = new StringWriter();
        XmlLightParser.Parse(Input, XmlLightParser.AttributeFormat.Html, new OutputFormatter(output));
        Console.WriteLine(output.ToString());
    }

    private class OutputFormatter : IXmlLightReader
    {
        private readonly TextWriter _output;
        public OutputFormatter(TextWriter output)
        {
            _output = output;
        }

        void IXmlLightReader.StartDocument() { }
        void IXmlLightReader.EndDocument() { }

        public void StartTag(XmlTagInfo tag)
        {
            _output.Write(tag.UnparsedTag);
        }

        public void EndTag(XmlTagInfo tag)
        {
            _output.Write(tag.UnparsedTag);
        }

        public void AddText(string content)
        {
            _output.Write(HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(content)));
        }

        public void AddComment(string comment)
        {
            _output.Write(comment);
        }

        public void AddCData(string cdata)
        {
            _output.Write(cdata);
        }

        public void AddControl(string cdata)
        {
            _output.Write(cdata);
        }

        public void AddInstruction(string instruction)
        {
            _output.Write(instruction);
        }
    }
}

The preceeding program outputs the following results:

<?xml version="1.0"?>
<div>
    <a href="link">Link with &lt; characters</a>
    <knownTag>Text with character &gt; &amp;and other &amp;BAD; stuff</knownTag>
    <knownTag>Text &lt; again &gt;</knownTag>
    <knownTag><![CDATA[ Text < again > ]]></knownTag>
<div>

Note: I added the xml declaration, CDATA, and '&' text for testing only.

Damith · Answer

use one of method from This question and remove html tags of the input

then

string output = new string(input.ToCharArray().Where(c=> c=='<'||c=='>').ToArray());

Michael Kay · Answer

Judging from your example, it seems you are not searching XML files as the subject suggests, but rather XML-like files - perhaps files that would be XML if they did not contain the "<" and ">" characters that you are looking for.

But you have not specified the task clearly enough. What should happen, for example, with

<tag1>xxxx</tag2>

or with

<tag1><x a="</tag1>"/></tag1>

Picking up the second case is pretty tough (perhaps impossible) to achieve with regular expressions alone. You need to define the grammar or the input language you want to accept (an extension of XML) and parse it using recursive parsing techniques.

Find all 'more or less than' characters which is not tags in xml

Tags:

c#

regex

xml

Carl-Otto Kjellkvist

3 Answers

csharptest.net

Damith

Michael Kay

Recent Activity

Donate For Us

Find all 'more or less than' characters which is not tags in xml

Tags:

c#

regex

xml

Carl-Otto Kjellkvist

3 Answers

csharptest.net

Damith

Michael Kay

Related questions

Recent Activity

Donate For Us