Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode to Windows-1251 Conversion with XML(HTML)-escaping

I have XML-file and need to produce HTML-file with Windows-1251 encoding by applying XSL Transformation. A problem is that Unicode characters of XSL -file are not converted to HTML Unicode Escape Sequence like "ғ" during XSL Transformation, only "?" sign is written instead of them. How can I ask XslCompiledTransform.Transform method to do this conversion? Or is there any method to write HTML-string into Windows-1251 HTML file with applying HTML Unicode Escape Sequences, so that I can perform XSL Transformation to string and then by this method to write to a file with Windows-1251 encoding and with HTML-escaping of all unicode characters (something like Convert("ғ") will return "ғ")?

XmlReader xmlReader = XmlReader.Create(new StringReader("<Data><Name>The Wizard of Wishaw</Name></data>"));

XslCompiledTransform xslTrans = new XslCompiledTransform();
xslTrans.Load("sheet.xsl");

using (XmlTextWriter xmlWriter = new XmlTextWriter("result.html", Encoding.GetEncoding("Windows-1251")))
{
    xslTrans.Transform(xmlReader, xmlWriter); // it writes Windows-1251 HTML-file but does not escape unicode characters, just writes "?" signs
}

Thanks all for help!

UPDATE

My output configuration tag in XSL-file:

<xsl:output method="xml" indent="yes" omit-xml-declaration="yes" />

I do not even hope now that XSL will satisfy my needs. But I wonder that I do not have any method to check if character is acceptable by specified encoding. Something like

Char.IsEncodable('ғ', Encoding.GetEncoding('Windows-1251'))

My current solution is to convert all characters greater than 127 (c > 127) to &#dddd; escape strings, but my chief is not satisfied by the solution, because the source of generated HTML-file is not readable.

like image 637
meir Avatar asked May 10 '11 08:05

meir


1 Answers

Do note that XML is both a data model and a serialization format. The data can use different character set than the serialization of this data.

It looks like the key reason to your problem is that your serialization process is trying to limit the character set of the data model, whereas you would like to set the character set of the serialization format. Let's have an example: <band>Motörhead</band> and <band>Mot&#246;rhead</band> are equal XML documents. They have the same structure and exactly the same data. Because of the heavy metal umlaut, the character set of the data is unicode (or something bigger than ASCII) but, because the usage of a character reference &#246;, the character set of the latter serialization form of the document is ASCII. In order to process this data, your XML tools still need to be unicode aware in both cases, but when using the latter serialization, the I/O and file transfer tools don't need to be unicode aware.

My guess is that by telling the XMLTextWriter to use Windows-1251 encoding, it probably in practice tries to limit the character set of the data to the characters contained in Windows-1251 by discarding all the characters outside this character set and writing a ? character instead.

However, since you produce your XML document by an XSL transformation, you can control the character set of the serialization directly in your XSLT document. This is done by adding a encoding attribute to the xsl:output element. Modify it to look like this

<xsl:output method="xml" indent="yes" omit-xml-declaration="yes" encoding="windows-1251"/>

Now the XSLT processor takes care of the serialization to reduced character set and outputs a character reference for all characters in the data that are included in windows-1251.

If changing the character set of the data is really what you need, then you need to process your data with a suitable character conversion library that can guess the most suitable replacement character (like ö -> o).

like image 134
jasso Avatar answered Oct 07 '22 22:10

jasso