Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

why does xslt output encoding=utf-8 not convert iso-8859-1 character?

Why is an iso-8859-1 character not converted to utf-8 in the output file when setting output encoding to utf-8?

I have an xml input file in iso-8859-1 encoding, and the encoding is declared. I want to output it in utf-8. My understanding is that setting the output encoding in the xslt file should manage the character conversion.

Is my understanding wrong? If not, why does the following simple test case output an iso-8859-1 character in a utf-8 declared output file?

My input file looks like this:

<?xml version="1.0" encoding="ISO-8859-1"?>
<data>ö</data>

My transform looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>
    <xsl:output encoding="UTF-8" />
    <xsl:template match="/">
        <result>
            <xsl:value-of select="." />
        </result>
    </xsl:template>
</xsl:stylesheet>

Using saxon9he from the command line my result looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<result>ö</result>

The ö in my result file is 0xF6 according to BabelPad, which is an invalid utf-8 character. The ö seems to be untouched by the transformation.

Thanks for any help!

like image 466
user1981490 Avatar asked Feb 08 '13 01:02

user1981490


1 Answers

I can see two possible explanations (thought there are probably others).

(a) the final stage of serialization, that is, converting characters to bytes, is not being done by the XSLT processor but by some other piece of software that does not have access to the stylesheet. This would happen, for example, if you run the transformation in a Java application that sends the output to a Writer rather than an OutputStream - the Writer would convert characters to bytes using the platform default encoding, which is probably iso-8859-1.

(b) the octets you are seeing in your display are not the octets stored on disk, but some transformation of them. This can happen when you load a file into an editor and then ask for a hex display; in some cases you will get a hex display of the editor's in-memory representation of the document, not of what is stored on disk.

like image 182
Michael Kay Avatar answered Oct 17 '22 22:10

Michael Kay