Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

xslt transforms utf-8 characters to a different encoding

Tags:

xslt

This problem is occurring intermittently, that is I have performed many xslt transformation without this problem, then it suddenly appeared during my latest xslt transformation.

I have a large number of html input files with a structure similar to the following a.html:

<html>
  <body>
    <div class="wrd">
      <div class="wrd-id">5</div>
      <div class="wrd-wrd">address</div>
      <div class="wrd-ipa">əˈdres,ˈaˌdres</div>
    </div>
    <div class="a">...</div>
  </body>
</html>

When I check the encoding of the input files I get the following result:

file -I a.html 
a.html: text/html; charset=utf-8

I transform the html files with an xslt similar to the following a.xslt:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" >
 <xsl:output omit-xml-declaration="yes" indent="yes" encoding="UTF-8" />
 <xsl:strip-space elements="*" />

 <xsl:template match="@*|node()" >
  <xsl:copy>
   <xsl:apply-templates select="@*|node()" />
  </xsl:copy>
 </xsl:template>

 <xsl:template match="div[@class='a']" >
  <xsl:apply-templates select="*|node()" />
 </xsl:template>

</xsl:stylesheet>

I use a script similar to the following a.sh:

#!/bin/bash
xsltproc --html a.xslt a.html > b.html

A more complete bash script is the following:

#!/bin/bash
xsltproc --html a.xslt a.html \
| hxnormalize -x -l 1024 \
| sed '/^$/d' \
> b.html

And I obtain the following result b.html:

<html>
  <body>
    <div class="wrd">
      <div class="wrd-id">5</div>
      <div class="wrd-wrd">address</div>
      <div class="wrd-ipa">ÉËdres,ËaËdres</div>
    </div>
    ...
  </body>
</html>

In fact my output contains a few upside-down question marks that I cannot copy and paste here. Please see picture below

non UTF-8 output

The input characters that belong to the UTF-8 character set have been transformed into something else.

When I check the encoding of the file b.html I get the following result:

file -I b.html
b.html: text/html; charset=utf-8

How can I prevent an xslt transformation to change my characters from one encoding to another?

UPDATE 1

By removing the option "--html" from the xsltproc command, the problem is resolved. However I am still not sure why.

#!/bin/bash
xsltproc a.xslt a.html > b.html

UPDATE 2

It seems that the input file is interpreted as ASCII or ISO-8859-1 instead of UTF-8. I have inserted the following header in the input a.html:

  <head>
    <meta charset="UTF-8">
    <meta http-equiv="content-type" content="text/html">
  </head>

However the output b.html is still the same.

UPDATE 3

I have update a.xslt to the following:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="html" version="4.0" encoding="UTF-8" indent="yes" />
 <xsl:strip-space elements="*"/>

 <xsl:template match="@* | node()">
  <xsl:copy>
   <xsl:apply-templates select="@* | node()"/>
  </xsl:copy>
 </xsl:template>

</xsl:stylesheet>

Please notice the different xsl:output line

This creates b.html with the same problem but the first line gives the following html declaration:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">

Perhaps behind here there is the reason why ASCII or ISO-8859-1 is used to interpret the input file.

like image 592
Yalmar Avatar asked Mar 27 '16 12:03

Yalmar


1 Answers

SOLUTION

xsltproc picks up the file encoding of HTML input files from the META Content-Type header. When such a header is not present it might assume the file encoding incorrectly and butcher the file while reading it.

I have inserted the following header in the input a.html:

<head>
  <meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>

And I have run the following bash script:

#!/bin/bash
xsltproc --html a.xslt a.html > b.html

The xslt a.xslt is the following:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="html" version="4.0" encoding="UTF-8" indent="yes" />
 <xsl:strip-space elements="*"/>

 <xsl:template match="@* | node()">
  <xsl:copy>
   <xsl:apply-templates select="@* | node()"/>
  </xsl:copy>
 </xsl:template>

</xsl:stylesheet>

And the output file b.html is at last as expected:

<html>
  <body>
    <div class="wrd">
      <div class="wrd-id">5</div>
      <div class="wrd-wrd">address</div>
      <div class="wrd-ipa">əˈdres,ˈaˌdres</div>
    </div>
    <div class="a">...</div>
  </body>
</html>
like image 93
Yalmar Avatar answered Nov 16 '22 11:11

Yalmar