Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# XmlWriter and invalid UTF8 characters

Tags:

c#

.net

xml

utf-8

We created a unit test that uses the following methods to generate random UTF8 text:

        private static Random _rand = new Random(Environment.TickCount);

        public static byte CreateByte()
        {
            return (byte)_rand.Next(byte.MinValue, byte.MaxValue + 1);
        }

        public static byte[] CreateByteArray(int length)
        {
            return Repeat(CreateByte, length).ToArray();
        }

        public static string CreateUtf8String(int length)
        {
            return Encoding.UTF8.GetString(CreateByteArray(length));
        }

        private static IEnumerable<T> Repeat<T>(Func<T> func, int count)
        {
            for (int i = 0; i < count; i++)
            {
                yield return func();
            }
        }

In sending the random UTF8 strings to our business logic, XmlWriter writes the generated string and can fail with the error:

Test method UnitTest.Utf8 threw exception: 
System.ArgumentException: ' ', hexadecimal value 0x0E, is an invalid character.

System.Xml.XmlUtf8RawTextWriter.InvalidXmlChar(Int32 ch, Byte* pDst, Boolean entitize)
System.Xml.XmlUtf8RawTextWriter.WriteAttributeTextBlock(Char* pSrc, Char* pSrcEnd)
System.Xml.XmlUtf8RawTextWriter.WriteString(String text)
System.Xml.XmlUtf8RawTextWriterIndent.WriteString(String text)
System.Xml.XmlWellFormedWriter.WriteString(String text)
System.Xml.XmlWriter.WriteAttributeString(String localName, String value)

We want to support any possible string to be passed in, and need these invalid characters escaped somehow.

XmlWriter already escapes things like &, <, >, etc., how can we deal with other invalid characters such as control characters, etc?

PS - let me know if our UTF8 generator is flawed (I'm already seeing where I shouldn't let it generate '\0')

like image 600
jonathanpeppers Avatar asked Dec 08 '10 22:12

jonathanpeppers


People also ask

What C is used for?

C programming language is a machine-independent programming language that is mainly used to create many types of applications and operating systems such as Windows, and other complicated programs such as the Oracle database, Git, Python interpreter, and games and is considered a programming foundation in the process of ...

What is C in C language?

What is C? C is a general-purpose programming language created by Dennis Ritchie at the Bell Laboratories in 1972. It is a very popular language, despite being old. C is strongly associated with UNIX, as it was developed to write the UNIX operating system.

What is the full name of C?

In the real sense it has no meaning or full form. It was developed by Dennis Ritchie and Ken Thompson at AT&T bell Lab. First, they used to call it as B language then later they made some improvement into it and renamed it as C and its superscript as C++ which was invented by Dr.

Is C language easy?

Compared to other languages—like Java, PHP, or C#—C is a relatively simple language to learn for anyone just starting to learn computer programming because of its limited number of keywords.


4 Answers

The XmlConvert Class has a lot of useful methods (like EncodeName, IsXmlChar, ...) for making sure you're building valid Xml.

like image 109
Simon Mourier Avatar answered Sep 23 '22 06:09

Simon Mourier


Your UTF-8 generator appears to be flawed. There are many byte sequences which are invalid UTF-8 encodings.

A better way to generate valid random UTF-8 encodings is to generate random characters, put them into a string and then encode the string to UTF-8.

like image 43
Mark Byers Avatar answered Sep 23 '22 06:09

Mark Byers


There are two problems:

  1. Not all characters are valid for XML, even escaped. For XML 1.0, the only characters with a Unicode codepoint value of less than 0x0020 that are valid are TAB (&#9;), LF (&#10;), and CR (&#13;). See XML 1.0, Section 2.2, Characters .

    For XML 1.1, which relatively few systems support, any character except NUL can be escaped in this manner.

  2. Not all sequences of bytes are valid for UTF-8. For example, according to the specification, "The octet values C0, C1, F5 to FF never appear." Probably you would be better off just creating Strings of characters and ignoring UTF-8, or creating the String, converting it to UTF-8 and back if you're really into encoding.

like image 41
lavinio Avatar answered Sep 21 '22 06:09

lavinio


Mark points out that not every byte sequence is a valid UTF-8 sequence.

I'd like to add that not every character can exist in an XML document. Only some characters are valid, and this is true even if they are encoded as a numeric character reference.

Update: If you want to encode arbitrary binary data in XML, then use Base64 or some other encoding before writing them to XML.

like image 31
Stephen Cleary Avatar answered Sep 24 '22 06:09

Stephen Cleary