Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF-8 or ISO-8859-1 in XML

Tags:

xml

utf-8

We have an application this takes a text string entered by a user into a web form and packages it in XML. Just to confuse matters a little, the XML is send as the body of on Outlook email message.

Because the users can paste almost anything into the web form (typically from Word), the text string can contain non-ASCII (7 bit) characters such as those used for open and close double quotes.

The string is travelling intact via email but when we use the Microsoft XML parser, it complains (quite rightly) that the XML contains invalid characters.

A quick fix is to put encoding="iso-8859-1" in the header. However, I wonder if it would be better to encode the XML file in true UTF-8 format at the start as I've read articles that state it would be better for a more harmonious world if every XML document was encoded in UTF-8?

But... are we going to have trouble as the XML document is actually being transferred via the body of an email message? I understand that UTF-8 is a variable byte length encoding system I assume using 7 bit ASCII and escapte characters to indicate "there is more data".

Another option is to set to UTF-8 but replace non-ASCII characters with the &#nnn; format.

Any advise on this rather complicated area appreciated.

Cheers, Rob.

like image 549
Rob Nicholson Avatar asked Aug 11 '09 09:08

Rob Nicholson


2 Answers

Here from outside english-only-land{1} I can confirm that UTF-8 works fine everywhere and has done so for many, many years. I have trouble remembering since when any MTA crippled emails by stripping of the 8th bit (leading to "inventions" like QP (which were basically fixing the symptom rather than solving the problem)). That happened most certainly during mid-90s, although UTF-8 quickly gained popularity and replaced iso-8859-1. I do not remember when I switched, but I guess it was at least before year 2000.

Speaking of iso-8859-1, it will not be able to cover all possible input from your users. Depending on language, other iso-8859 variants might be needed (for instance for Finnish and Welsh), and even so the 8859 family does not support languages like Chinese. UTF-8 in the other hand should cover everything, so I strongly recommend that to iso-8859-1.

{1} This might bias my experience since any program not fully supporting UTF-8 would be considered crap and tend not to be used here.

like image 137
hlovdal Avatar answered Nov 10 '22 15:11

hlovdal


I would probably try to use UTF-8 whenever possible - it just covers more ground and is more flexible than ISO-8859-1 which will choke on e.g. Eastern European characters already (try to write Jiři or something like that in ISO-8859-1 - it'll fail miserably).

So if you really want to attempt to change (which I applaud!), then I'd go UTF-8 and only resort back to ISO-8859-1 if you really can't make UTF-8 work.

MArc

like image 26
marc_s Avatar answered Nov 10 '22 16:11

marc_s