Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XML to TeX or how to get a beautiful PDF from XHTML-like source

Superficially, an easy question: how do I get a great-looking PDF from my XML document? Actually, my input is a subset of XHTML with a few custom attributes added (to save some information on citation sources, etc). I've been exploring some routes and would like to get some feedback if anyone has tried some of this before.

Note: I've considered XSL-FO to generate PDFs but heard the typographic quality of open source tools is still lagging behind TeX a lot. Guess the most advanced one is Apache FOP. But I'm really interested in a great-looking PDFs (otherwise I could use the print dialog of my browser). Any thoughts, updates on this?

So I've been thinking of using XSLT to convert my customized XML/XHTML dialect to DocBook and go from there (DocBook via XSLT to proper HTML seems to work quite well, so I might use it for that as well). But how do I go from DocBook to TeX? I've come across a number of solutions.

  • dblatex A set of XSLT stylesheets that output LaTeX.
  • db2latex Started as a clone of dblatex but now provides tighter integration with LaTex packages and provides a single script to output PDF, which is quite nice.
  • passiveTex Instead of XSLT it uses a XML parser written in TeX.
  • TeXML is essentially an XML serialization of the LaTeX language which can be used as an intermediate format and an accompanying python tool that transforms from that XML format to LaTeX/ConTeXt. They claimed that this avoids the existing solutions' problems with special symbols, losing some braces or spaces and support for only latin-1 encoding. (Is this still the case?)

As my input XML might contains quite a few special characters represented in Unicode, the last point is especially important to me. I've also been thinking of using XeTeX instead of pdfTeX to get around this problem. (I might loose some typographic quality though, but maybe still better than current open source XSL-FO processors?) So db2latex and TeXML seem to be the favorites. So can anybody comment on the robustness of those?

Alternatively, I might have more luck using ConTeXt directly, as there seems to be quite some interest in the ConTeXt community in XML. Especially, I might take a deeper look at "My Way: Getting Web Content and pdf-Output from One Source" and "Dealing with XML in ConTeXt MkIV". Both documents describe an approach using ConTeXt combined with LuaTeX. (DocBook In ConTeXt seems to do about the same but the latest version is from 2003.) The second document notes:

You may wonder why we do these manipulations in TEX and not use xslt instead. The advantage of an integrated approach is that it simplifies usage. Think of not only processing the a document, but also using xml for managing resources in the same run. An xslt approach is just as verbose (after all, you still need to produce TEX code) and probably less readable. In the case of MkIV the integrated approach is is also faster and gives us the option to manipulate content at runtime using Lua.

What do you think about this? Please keep in mind that I have some experience with both XSLT and TeX but have never gone terribly deep into either of them. Never tried many different LaTeX packages or alternatives such as ConTeXt (or XeTeX/LuaTeX instead of pdfTeX) but I am willing to learn some new stuff to get my beautiful PDFs in the end ;)

Also, I stumbled over Pandoc but couldn't find any info on how it compares to the other mentioned approaches. And lastly, a link to some quite extensive documentation on how to use TeXML with ConTeXt.

like image 235
mb21 Avatar asked Apr 08 '12 12:04

mb21


1 Answers

I've done something like this in the past (that is, maintaining master versions of documents in XML, and wanting to produce LaTeX output from them).

I've used PassiveTeX in the past, but I found creating stylesheets to be hard work -- the usual result of writing two languages at once. I got it to work, and the result looked very good, but it was probably more effort than it was worth. That said, if you amount of styling you need to add is small, then this might be a good route, because it's a single step.

The most successful route (read, flexible and attractive), was to use XSLT to transform the document into structural LaTeX, which matches the intended structure of the result document, but which doesn't attempt to do more than minimal formatting. Depending on your document, that might be normal-looking LaTeX, or it might have bespoke structures. Then write or adapt a LaTeX stylesheet or class file which formats that output into something attractive. That way, you're using XSLT to its strengths (and not going beyond them, which rapidly becomes very frustrating), using LaTeX to its strengths, and not confusing yourself.

That is, this more-or-less matches the approach of your first two alternatives, and whether you go with them, or write/customise a LaTeX stylesheet with bespoke output, is a function of how comfortable you feel with LaTeX stylesheets, and how much complicated or specialised formatting you need to do.

Since you say you need to handle Unicode characters in the input, then yes, XeLaTeX would be a good choice for the LaTeX part of the pipeline.

like image 157
Norman Gray Avatar answered Sep 29 '22 11:09

Norman Gray