Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find all 'more or less than' characters which is not tags in xml

Tags:

c#

regex

xml

I need a regex to find all '<' or '>' which is not xml-tags.

Example:

<tag1>W<E><E</tag1>Z<>S

Should find

<><<>

Example:

<tag1>W<E><E</E></tag1>Z<>S

Should find

<<>

So, any hits where '<' or '>' is not in a tag (yes we alos have al self-closing tags which should be taken into consideration :)

Edit #2: What I want to do in the end is to replace all matches with the html-encoded values.

Edit #3:

So what I want to do is from a text containing HTML with some additional tags (very few known tags) get all '<' and '>' which is not included in the tags.

Example (the bold ones I want to find so I can replace them with their encoded values):

<div>
  <a href="link">Link with < characters</a>
  <knownTag>Text with character ></knownTag>
  <knownTag>Text < again ></knownTag>
<div>

Result should be:

<div>
  <a href="link">Link with &lt; characters</a>
  <knownTag>Text with character &gt;</knownTag>
  <knownTag>Text &lt; again &gt;</knownTag>
<div>

Any idea on how to solve this problem?

like image 201
Carl-Otto Kjellkvist Avatar asked Jun 09 '13 17:06

Carl-Otto Kjellkvist


3 Answers

This can be done with regex; however, it's not as simple as you suggest. You will need to find valid tags and process them in order to make this work. It just so happens that I did this some time ago when writing a fast and lightwieght xml/html parser. The code is available at:

http://csharptest.net/browse/src/Library/Html/XmlLightParser.cs http://csharptest.net/browse/src/Library/Html/XmlLightInterfaces.cs

To use the parser, you will implement the defined interface IXmlLightReader from the later of the two source files. The following example produces your desired results, and also handles several other capabilities you did not mention, like CDATA sections, processing instructions, DTDs, etc.

class RegexForBadXml
{
    const string Input = "<?xml version=\"1.0\"?>\r\n<div>\r\n\t<a href=\"link\">Link with < characters</a>\r\n\t<knownTag>Text with character > &and other &#BAD; stuff</knownTag>\r\n\t<knownTag>Text < again ></knownTag>\r\n\t<knownTag><![CDATA[ Text < again > ]]></knownTag>\r\n<div>";

    private static void Main()
    {
        var output = new StringWriter();
        XmlLightParser.Parse(Input, XmlLightParser.AttributeFormat.Html, new OutputFormatter(output));
        Console.WriteLine(output.ToString());
    }

    private class OutputFormatter : IXmlLightReader
    {
        private readonly TextWriter _output;
        public OutputFormatter(TextWriter output)
        {
            _output = output;
        }

        void IXmlLightReader.StartDocument() { }
        void IXmlLightReader.EndDocument() { }

        public void StartTag(XmlTagInfo tag)
        {
            _output.Write(tag.UnparsedTag);
        }

        public void EndTag(XmlTagInfo tag)
        {
            _output.Write(tag.UnparsedTag);
        }

        public void AddText(string content)
        {
            _output.Write(HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(content)));
        }

        public void AddComment(string comment)
        {
            _output.Write(comment);
        }

        public void AddCData(string cdata)
        {
            _output.Write(cdata);
        }

        public void AddControl(string cdata)
        {
            _output.Write(cdata);
        }

        public void AddInstruction(string instruction)
        {
            _output.Write(instruction);
        }
    }
}

The preceeding program outputs the following results:

<?xml version="1.0"?>
<div>
    <a href="link">Link with &lt; characters</a>
    <knownTag>Text with character &gt; &amp;and other &amp;BAD; stuff</knownTag>
    <knownTag>Text &lt; again &gt;</knownTag>
    <knownTag><![CDATA[ Text < again > ]]></knownTag>
<div>

Note: I added the xml declaration, CDATA, and '&' text for testing only.

like image 131
csharptest.net Avatar answered Oct 11 '22 10:10

csharptest.net


use one of method from This question and remove html tags of the input

then

string output = new string(input.ToCharArray().Where(c=> c=='<'||c=='>').ToArray());
like image 23
Damith Avatar answered Oct 11 '22 11:10

Damith


Judging from your example, it seems you are not searching XML files as the subject suggests, but rather XML-like files - perhaps files that would be XML if they did not contain the "<" and ">" characters that you are looking for.

But you have not specified the task clearly enough. What should happen, for example, with

<tag1>xxxx</tag2>

or with

<tag1><x a="</tag1>"/></tag1>

Picking up the second case is pretty tough (perhaps impossible) to achieve with regular expressions alone. You need to define the grammar or the input language you want to accept (an extension of XML) and parse it using recursive parsing techniques.

like image 43
Michael Kay Avatar answered Oct 11 '22 11:10

Michael Kay