Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best practice for handling vertical tabs and other invalid xml characters

Tags:

text

xml

I have an application which (like many others) takes in user input, stores it in a database and then later processes it using (amongst other things) XML tools. The application takes in free text input and like many other developers I am very careful with escaping and quoting so it can handle input containing different types of whitespace, quote characters, reserved XML characters etc.

However, occasionally a user will manage to enter a string containing a vertical tab character (hex 0B) or a form feed (hex 0C). this cannot be processed by XML tools at all and causes the app to barf.

In my application it's quite important to preserve the original input during the 'round trip' process, so i'm loath to just strip out any characters I don't like, especially things like form feed which are still occasionally used in plain text files.

is there any accepted best practice or general strategy for handling these characters when XML processing is involved?

like image 537
Andy Avatar asked Dec 05 '11 14:12

Andy


People also ask

What are invalid XML characters?

The only illegal characters are & , < and > (as well as " or ' in attributes, depending on which character is used to delimit the attribute value: attr="must use &quot; here, ' is allowed" and attr='must use &apos; here, " is allowed' ). They're escaped using XML entities, in this case you want &amp; for & .

How do I find an invalid character in XML?

If you're unable to identify this character visually, then you can use a text editor such as TextPad to view your source file. Within the application, use the Find function and select "hex" and search for the character mentioned. Removing these characters from your source file resolve the invalid XML character issue.

What is vertical tab character?

The horizontal tab is usually inserted when the Tab key on a standard keyboard is pressed. A vertical tabulation (VT) also exists and has ASCII decimal character code 11 ( Ctrl + K or ^K), escape character \v .


1 Answers

Yes, unfortunately some characters are illegal in XML, and have no entity equivalent. As one of those examples, see:

http://www.jdom.org/docs/apidocs.1.1/org/jdom/Element.html#setText(java.lang.String)

which is a String setter... that can throw an exception! Vertical tab is exactly one of those characters for which there is no XML entity, nor a way to "escape" it with XML alone.

I'm working around this myself by using base64 encoding to sanitize strings that might harbor those characters. It's a bit silly, since I have to base64-encode and decode all the time, but I don't think there's a good alternative.

like image 67
dyoo Avatar answered Nov 01 '22 00:11

dyoo