I have looked around a lot but have not been able to find a built-in .Net method that will only escape special XML characters:
<
, >
, &
, '
and "
if it's not a tag.
For example, take the following text:
Test& <b>bold</b> <i>italic</i> <<Tag index="0" />
I want it to be converted to:
Test& <b>bold</b> <i>italic</i> <<Tag index="0" />
Notice that the tags are not escaped. I basically need to set this value to an InnerXML
of an XmlElement
and as a result, those tags must be preserved.
I have looked into implementing my own parser and use a StringBuilder
to optimize it as much as I can but it can get pretty nasty.
I also know the tags that are acceptable which may simplify things (only: br, b, i, u, blink, flash, Tag). In addition, these tags can be self closing tags
(e.g. <u />)
or container tags
(e.g. <u>...</u>)
XML escape characters There are only five: " " ' ' < < > > & & Escaping characters depends on where the special character is used. The examples can be validated at the W3C Markup Validation Service.
The special characters can be referenced in XML using one of 3 formats: &name; where name is the character name (if available) such as quot, amp, apos, lt, or gt. &#nn; where nn is the decimal character code reference. &#xhh; where xhh is the hexadecimal character code reference.
Answer. Special characters (such as <, >, &, ", and ' ) can be replaced in XML documents with their html entities using the DocumentKeywordReplace service. However, since html entities used within BPML are converted to the appropriate character, the string mode of DocumentKeywordReplace will not work in this instance.
If you just want to know whether or not the original String contains escape sequences, search for "\\" in the Apache-fied string.
NOTE: This could probably be optimised. It was just something I knocked up quickly for you. Also note that I am not doing any validation of the tags themselves. It's just looking for content wrapped in angle brackets. It will also fail if an angle bracket was found within the tag (e.g. <sometag label="I put an > here">
). Other than that, I think it should do what you're asking for.
namespace ConsoleApplication1
{
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main(string[] args)
{
// This is the test string.
const string testString = "Test& <b>bold</b> <i>italic</i> <<Tag index=\"0\" />";
// Do a regular expression search and replace. We're looking for a complete tag (which will be ignored) or
// a character that needs escaping.
string result = Regex.Replace(testString, @"(?'Tag'\<{1}[^\>\<]*[\>]{1})|(?'Ampy'\&[A-Za-z0-9]+;)|(?'Special'[\<\>\""\'\&])", (match) =>
{
// If a special (escapable) character was found, replace it.
if (match.Groups["Special"].Success)
{
switch (match.Groups["Special"].Value)
{
case "<":
return "<";
case ">":
return ">";
case "\"":
return """;
case "\'":
return "'";
case "&":
return "&";
default:
return match.Groups["Special"].Value;
}
}
// Otherwise, just return what was found.
return match.Value;
});
// Show the result.
Console.WriteLine("Test String: " + testString);
Console.WriteLine("Result : " + result);
Console.ReadKey();
}
}
}
I personally don't think it is possible, because you are really trying to fix malformed HTML, and therefore there are no rules which you can use to determine what is to be encoded and what isn't.
Any which way you look at it, something like <<Tag index="0" />
is not valid HTML.
If you know the actual tags you may be able create a white list which could simplify things, but you are going to have to attack your problem more specifically, I do not think you will be able to solve this for any scenario.
In fact, chances are you haven't actually got any random <
or >
lying around in your text, and that would (probably) greatly simplify the problem, but if you are really trying to come up with a generic solution....I wish you luck.
Here's a regular expression you can use that will match any invalid <
or >
.
(\<(?! ?/?(?:b|i|br|u|blink|flash|Tag[^>]*))|(?<! ?/?(?:b|i|br|u|blink|flash|Tag[^>]*))\>)
I suggest putting the valid tag-test expression into a variable and then constructing the rest around it.
var validTags = "b|i|br|u|blink|flash|Tag[^>]*";
var startTag = @"\<(?! ?/?(?:" + validTags + "))";
var endTag = @"(?<! ?/?(?:" + validTags + "))/>";
Then just do RegEx.Replace
on these.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With