Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to encode text data for XML

I was looking for a generic method in .Net to encode a string for use in an Xml element or attribute, and was surprised when I didn't immediately find one. So, before I go too much further, could I just be missing the built-in function?

Assuming for a moment that it really doesn't exist, I'm putting together my own generic EncodeForXml(string data) method, and I'm thinking about the best way to do this.

The data I'm using that prompted this whole thing could contain bad characters like &, <, ", etc. It could also contains on occasion the properly escaped entities: &amp;, &lt;, and &quot;, which means just using a CDATA section may not be the best idea. That seems kinda klunky anyay; I'd much rather end up with a nice string value that can be used directly in the xml.

I've used a regular expression in the past to just catch bad ampersands, and I'm thinking of using it to catch them in this case as well as the first step, and then doing a simple replace for other characters.

So, could this be optimized further without making it too complex, and is there anything I'm missing? :

Function EncodeForXml(ByVal data As String) As String     Static badAmpersand As new Regex("&(?![a-zA-Z]{2,6};|#[0-9]{2,4};)")      data = badAmpersand.Replace(data, "&amp;")      return data.Replace("<", "&lt;").Replace("""", "&quot;").Replace(">", "gt;") End Function 

Sorry for all you C# -only folks-- I don't really care which language I use, but I wanted to make the Regex static and you can't do that in C# without declaring it outside the method, so this will be VB.Net

Finally, we're still on .Net 2.0 where I work, but if someone could take the final product and turn it into an extension method for the string class, that'd be pretty cool too.

Update The first few responses indicate that .Net does indeed have built-in ways of doing this. But now that I've started, I kind of want to finish my EncodeForXml() method just for the fun of it, so I'm still looking for ideas for improvement. Notably: a more complete list of characters that should be encoded as entities (perhaps stored in a list/map), and something that gets better performance than doing a .Replace() on immutable strings in serial.

like image 878
Joel Coehoorn Avatar asked Oct 01 '08 13:10

Joel Coehoorn


People also ask

What encoding does XML use?

xml version="1.0" encoding="ISO-8859-1"?> Without this information, the default encoding is UTF-8 or UTF-16, depending on the presence of a UNICODE byte-order mark (BOM) at the beginning of the XML file.

How do I encode an XML file?

Encoding is the process of converting unicode characters into their equivalent binary representation. When the XML processor reads an XML document, it encodes the document depending on the type of encoding. Hence, we need to specify the type of encoding in the XML declaration.

Is XML an UTF-8?

You can write the XML file in any text editor. For non-ASCII characters, such as characters with diacritics and Kanji characters, an editor that can save the file as UTF-8 is required. Because UTF-8 is not easily displayed or edited on z/OS®, the XML can be encoded in UTF-8 or using the agent's code page.

What does <? XML version 1.0 encoding UTF-8 ?> Mean?

version="1.0" means that this is the XML standard this file conforms to. encoding="utf-8" means that the file is encoded using the UTF-8 Unicode encoding.


2 Answers

Depending on how much you know about the input, you may have to take into account that not all Unicode characters are valid XML characters.

Both Server.HtmlEncode and System.Security.SecurityElement.Escape seem to ignore illegal XML characters, while System.XML.XmlWriter.WriteString throws an ArgumentException when it encounters illegal characters (unless you disable that check in which case it ignores them). An overview of library functions is available here.

Edit 2011/8/14: seeing that at least a few people have consulted this answer in the last couple years, I decided to completely rewrite the original code, which had numerous issues, including horribly mishandling UTF-16.

using System; using System.Collections.Generic; using System.IO; using System.Linq;  /// <summary> /// Encodes data so that it can be safely embedded as text in XML documents. /// </summary> public class XmlTextEncoder : TextReader {     public static string Encode(string s) {         using (var stream = new StringReader(s))         using (var encoder = new XmlTextEncoder(stream)) {             return encoder.ReadToEnd();         }     }      /// <param name="source">The data to be encoded in UTF-16 format.</param>     /// <param name="filterIllegalChars">It is illegal to encode certain     /// characters in XML. If true, silently omit these characters from the     /// output; if false, throw an error when encountered.</param>     public XmlTextEncoder(TextReader source, bool filterIllegalChars=true) {         _source = source;         _filterIllegalChars = filterIllegalChars;     }      readonly Queue<char> _buf = new Queue<char>();     readonly bool _filterIllegalChars;     readonly TextReader _source;      public override int Peek() {         PopulateBuffer();         if (_buf.Count == 0) return -1;         return _buf.Peek();     }      public override int Read() {         PopulateBuffer();         if (_buf.Count == 0) return -1;         return _buf.Dequeue();     }      void PopulateBuffer() {         const int endSentinel = -1;         while (_buf.Count == 0 && _source.Peek() != endSentinel) {             // Strings in .NET are assumed to be UTF-16 encoded [1].             var c = (char) _source.Read();             if (Entities.ContainsKey(c)) {                 // Encode all entities defined in the XML spec [2].                 foreach (var i in Entities[c]) _buf.Enqueue(i);             } else if (!(0x0 <= c && c <= 0x8) &&                        !new[] { 0xB, 0xC }.Contains(c) &&                        !(0xE <= c && c <= 0x1F) &&                        !(0x7F <= c && c <= 0x84) &&                        !(0x86 <= c && c <= 0x9F) &&                        !(0xD800 <= c && c <= 0xDFFF) &&                        !new[] { 0xFFFE, 0xFFFF }.Contains(c)) {                 // Allow if the Unicode codepoint is legal in XML [3].                 _buf.Enqueue(c);             } else if (char.IsHighSurrogate(c) &&                        _source.Peek() != endSentinel &&                        char.IsLowSurrogate((char) _source.Peek())) {                 // Allow well-formed surrogate pairs [1].                 _buf.Enqueue(c);                 _buf.Enqueue((char) _source.Read());             } else if (!_filterIllegalChars) {                 // Note that we cannot encode illegal characters as entity                 // references due to the "Legal Character" constraint of                 // XML [4]. Nor are they allowed in CDATA sections [5].                 throw new ArgumentException(                     String.Format("Illegal character: '{0:X}'", (int) c));             }         }     }      static readonly Dictionary<char,string> Entities =         new Dictionary<char,string> {             { '"', "&quot;" }, { '&', "&amp;"}, { '\'', "&apos;" },             { '<', "&lt;" }, { '>', "&gt;" },         };      // References:     // [1] http://en.wikipedia.org/wiki/UTF-16/UCS-2     // [2] http://www.w3.org/TR/xml11/#sec-predefined-ent     // [3] http://www.w3.org/TR/xml11/#charsets     // [4] http://www.w3.org/TR/xml11/#sec-references     // [5] http://www.w3.org/TR/xml11/#sec-cdata-sect } 

Unit tests and full code can be found here.

like image 73
Michael Kropat Avatar answered Sep 28 '22 05:09

Michael Kropat


SecurityElement.Escape

documented here

like image 32
workmad3 Avatar answered Sep 28 '22 05:09

workmad3