Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HXT ignoring HTML DTD, replacing it with XML DTD

I'm having a bit of trouble figuring out why HXT is replacing my DTD's. Firstly, here is my input file to be parsed:

<!DOCTYPE html>
<html>
  <head>
    <title>foo</title>
  </head>
  <body>
    <h1>foo</h1>
  </body>
</html>

and this is the output that I get:

<?xml version="1.0" encoding="US-ASCII"?>
<html>
  <head>
    <title>foo</title>
  </head>
  <body>
    <h1>foo</h1>
  </body>
</html>

Finally, here is a simplified version of the arrows I'm using:

start (App src dest) = runX $
                         readDocument [ withValidate no
                                      , withSubstDTDEntities no
                                      , withParseHTML yes
                                      --, withTagSoup
                                      ]
                                      src
                         >>>
                         this
                         >>>
                         writeDocument [ withIndent yes
                                       , withSubstDTDEntities no
                                       , withOutputHTML
                                       --, withOutputEncoding "UTF-8"
                                       ]
                                       dest

I apologize for the comments - I've been toying with different combinations of configs. I just can't seem to get HXT to not mess with DTDs, even with withSubstDTDEntities no, withValidate no, etc. I am getting a warning saying that HXT is ignoring my doctype declaration, but that's the only bit of insight I have. Can anyone please lend me a hand? Thank you in advance!

like image 593
Athan Clark Avatar asked Sep 29 '22 11:09

Athan Clark


1 Answers

You have two problems

HXT only accepts one of the following three html doctypes

<!DOCTYPE html 
 PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
 "DTD/xhtml1-strict.dtd">

<!DOCTYPE html
 PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
 "DTD/xhtml1-transitional.dtd">

<!DOCTYPE html
 PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"
 "DTD/xhtml1-frameset.dtd">

Using one of these will get rid of the warning about ignoring the dtd.

Second, add the following option to writeDocument

withAddDefaultDTD yes
like image 113
jamshidh Avatar answered Oct 04 '22 03:10

jamshidh