Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Encoding of ASCII string in UTF8 XML document in Byte array

I have some the folowing requirements:

...The document must be encoded in UTF-8... The Lastname field only allows (Extended) ASCII ... City only allows ISOLatin1 ...The message must be put on the (IBM Websphere) MessageQueue as a IBytesMessage

The XML document, for simplicities sake, looks like this:

<?xml version="1.0" encoding="utf-8"?>
<foo>
  <lastname>John ÐØë</lastname>
  <city>John ÐØë</city>
  <other>UTF-8 string</other>
</foo>

The "ÐØë" part are (or should be) ASCII values 208, 216, 235 respectively.

I also have an object:

public class foo {
  public string lastname { get; set; }
}

So I instantiate an object and set the lastname:

var x = new foo() { lastname = "John ÐØë", city = "John ÐØë" };

Now this is where my headache sets in (or the inception if you will...):

  • Visual studio / source code is in Unicode
  • Hence: Object has an Unicode lastname
  • The XML Serializer uses UTF-8 to encode the document
  • Lastname should contain only (Extended) ASCII characters; the characters are valid ASCII chars but ofcourse in UTF-8 encoded form

I normally don't experience any trouble with my encodings; I am familiar with The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) but this one's got me stumped...

I understand that the UTF-8 document will be perfectly able to "contain" both encodings because the codepoints 'overlap'. But where I get lost is when I need to convert the serialized message to a byte-array. When doing a dump I see C3 XX C3 XX C3 XX (I don't have the actual dump at hand). It's clear (or I've been staring at this for too long) that the lastname / city strings are put in the serialized document in their unicode form; the byte-array suggests so.

Now what will I have to do, and where, to ensure the Lastname string goes into the XML document and finally the byte-array as an ASCII string (and the actual 208, 216, 235 byte sequence), and that City makes it in there as ISOLatin1?

I know the requirements are backwards, but I can't change those (3rd party). I always use UTF-8 for our internal projects so I have to support the unicode-utf8=>ASCII/ISOLatin1 conversion (ofcourse, only for chars that are in those sets).

My head hurts...

like image 290
RobIII Avatar asked Apr 01 '26 17:04

RobIII


1 Answers

Never mind how the XML document is encoded for transmission. The right way to do what you want to do—encode certain non-ASCII characters so they survive the trip unscathed—is to use XML character references to represent the characters that need to be so preserved. For instance, your

ÐØë

is represented using XML character references as

&#x00D0;&#x00D8;&#x00EB;

The receiving [conformant] XML processor will/should/must convert those numeric character references back to the characters they represent. Here's some code that will do the trick:

public static string ConvertToXmlCharacterReference( this string xml )
{
  StringBuilder sb  = new StringBuilder( s.Length ) ;
  const char    SP  = '\u0020' ; // anything lower than SP is a control character
  const char    DEL = '\u007F' ; // anything above DEL isn't ASCII, per se.

  foreach( char ch in xml )
  {
    bool isPrintableAscii = ch >= SP && ch <= DEL ;

    if ( isPrintableAscii ) { sb.Append(ch)                             ; }
    else                    { sb.AppendFormat( "&#x{0:X4}" , (int) ch ) ; }

  }

  string instance = sb.ToString() ;
  return instance ;
}

You could also use a regular expression to make the replacement or write an XSLT that would do the same thing. But the task is so trivial, it doesn't really warrant that sort of approach. The above code is probably faster and less memory intensive and...it's easier to understand.

You should note though that since you want to preserve two different encodings in the same document, your conversion routine will need to differentiate between the conversion from "extended ASCII" to an XML character reference and the conversion from "ISO Latin 1" to an XML character reference.

In both cases, the character reference specifies a codepoint in the ISO/IEC 10646 character set — essentially unicode. You'll want to map the characters to the appropriate code point. Since string in the CLR world are UTF-16 encoded, that shouldn't be much of an issue. The above code should work fine, I believe, unless you've get something really oddball that doesn't play very nicely with UTF-16.

like image 175
Nicholas Carey Avatar answered Apr 03 '26 07:04

Nicholas Carey



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!