Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Conditionally escape special xml characters

I have looked around a lot but have not been able to find a built-in .Net method that will only escape special XML characters: <, >, &, ' and " if it's not a tag.

For example, take the following text:

Test& <b>bold</b> <i>italic</i> <<Tag index="0" />

I want it to be converted to:

Test&amp; <b>bold</b> <i>italic</i> &lt;<Tag index="0" />

Notice that the tags are not escaped. I basically need to set this value to an InnerXML of an XmlElement and as a result, those tags must be preserved.

I have looked into implementing my own parser and use a StringBuilder to optimize it as much as I can but it can get pretty nasty.

I also know the tags that are acceptable which may simplify things (only: br, b, i, u, blink, flash, Tag). In addition, these tags can be self closing tags

(e.g. <u />)

or container tags

(e.g. <u>...</u>)
like image 380
Amir Avatar asked Dec 19 '12 22:12

Amir


People also ask

How do you escape special characters in XML?

XML escape characters There are only five: " &quot; ' &apos; < &lt; > &gt; & &amp; Escaping characters depends on where the special character is used. The examples can be validated at the W3C Markup Validation Service.

Does XML accept special characters?

The special characters can be referenced in XML using one of 3 formats: &name; where name is the character name (if available) such as quot, amp, apos, lt, or gt. &#nn; where nn is the decimal character code reference. &#xhh; where xhh is the hexadecimal character code reference.

How do I change special characters in XML?

Answer. Special characters (such as <, >, &, ", and ' ) can be replaced in XML documents with their html entities using the DocumentKeywordReplace service. However, since html entities used within BPML are converted to the appropriate character, the string mode of DocumentKeywordReplace will not work in this instance.

How do you check for escape characters in a string?

If you just want to know whether or not the original String contains escape sequences, search for "\\" in the Apache-fied string.


3 Answers

NOTE: This could probably be optimised. It was just something I knocked up quickly for you. Also note that I am not doing any validation of the tags themselves. It's just looking for content wrapped in angle brackets. It will also fail if an angle bracket was found within the tag (e.g. <sometag label="I put an > here"> ). Other than that, I think it should do what you're asking for.

namespace ConsoleApplication1
{
    using System;
    using System.Text.RegularExpressions;

    class Program
    {
        static void Main(string[] args)
        {
            // This is the test string.
            const string testString = "Test& <b>bold</b> <i>italic</i> <<Tag index=\"0\" />";

            // Do a regular expression search and replace. We're looking for a complete tag (which will be ignored) or
            // a character that needs escaping.
            string result = Regex.Replace(testString, @"(?'Tag'\<{1}[^\>\<]*[\>]{1})|(?'Ampy'\&[A-Za-z0-9]+;)|(?'Special'[\<\>\""\'\&])", (match) =>
                {
                    // If a special (escapable) character was found, replace it.
                    if (match.Groups["Special"].Success)
                    {
                        switch (match.Groups["Special"].Value)
                        {
                            case "<":
                                return "&lt;";
                            case ">":
                                return "&gt;";
                            case "\"":
                                return "&quot;";
                            case "\'":
                                return "&apos;";
                            case "&":
                                return "&amp;";
                            default:
                                return match.Groups["Special"].Value;
                        }
                    }

                    // Otherwise, just return what was found.
                    return match.Value;
                });

            // Show the result.
            Console.WriteLine("Test String: " + testString);
            Console.WriteLine("Result     : " + result);
            Console.ReadKey();
        }
    }
}
like image 64
Nigel Whatling Avatar answered Oct 06 '22 00:10

Nigel Whatling


I personally don't think it is possible, because you are really trying to fix malformed HTML, and therefore there are no rules which you can use to determine what is to be encoded and what isn't.

Any which way you look at it, something like <<Tag index="0" /> is not valid HTML.

If you know the actual tags you may be able create a white list which could simplify things, but you are going to have to attack your problem more specifically, I do not think you will be able to solve this for any scenario.

In fact, chances are you haven't actually got any random < or > lying around in your text, and that would (probably) greatly simplify the problem, but if you are really trying to come up with a generic solution....I wish you luck.

like image 38
Ian G Avatar answered Oct 06 '22 00:10

Ian G


Here's a regular expression you can use that will match any invalid < or >.

(\<(?! ?/?(?:b|i|br|u|blink|flash|Tag[^>]*))|(?<! ?/?(?:b|i|br|u|blink|flash|Tag[^>]*))\>)

I suggest putting the valid tag-test expression into a variable and then constructing the rest around it.

var validTags = "b|i|br|u|blink|flash|Tag[^>]*";
var startTag = @"\<(?! ?/?(?:" + validTags + "))";
var endTag = @"(?<! ?/?(?:" + validTags + "))/>";

Then just do RegEx.Replace on these.

like image 26
Bobson Avatar answered Oct 05 '22 23:10

Bobson