Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PDF report with embedded HTML

We have a Java-based system that reads data from a database, merges individual data fields with preset XSL-FO tags and converts the result to PDF with Apache FOP.

In XSL-FO format it looks like this:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE Html [
<!ENTITY nbsp  "&#160;"> 
    <!-- all other entities -->
]>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:fo="http://www.w3.org/1999/XSL/Format">
    <xsl:output method="xml" indent="yes" />
    <xsl:template match="/">

        <fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format" xmlns:svg="http://www.w3.org/2000/svg" font-family="..." font-size="...">
            <fo:layout-master-set>          
                <fo:simple-page-master master-name="Letter Page" page-width="8.500in" page-height="11.000in">

                    <!-- appropriate settings -->

                </fo:simple-page-master>
            </fo:layout-master-set>
            <fo:page-sequence master-reference="Letter Page">

                <!-- some static content -->

            <fo:flow flow-name="xsl-region-body">
                    <fo:block>
                        <fo:table ...>
                            <fo:table-column ... />
                            <fo:table-body>
                                <fo:table-row>
                                    <fo:table-cell ...>
                                        <fo:block text-align="...">
                                            <fo:inline font-size="..." font-weight="...">
                                                <!-- Header / Title -->
                                            </fo:inline>
                                        </fo:block>
                                    </fo:table-cell>
                                </fo:table-row>
                            </fo:table-body>
                        </fo:table>
                    </fo:block>

                    <fo:block>

                        <fo:table ...>
                            <fo:table-column ... />
                            <fo:table-body> 
                                <fo:table-row>
                                    <fo:table-cell>
                                        <fo:block ...>
                                            <!-- Field A -->                                
                                        </fo:block>
                                    </fo:table-cell>
                                </fo:table-row>
                            </fo:table-body>
                        </fo:table>

                        <!-- Other fields in a very similar fashion as the above "Field A" -->

                    </fo:block>

                </fo:flow>      

            </fo:page-sequence>

        </fo:root>              

    </xsl:template>

</xsl:stylesheet>

Now I am looking for a way to allow some of the fields to contain static HTML-formatted content. This content will be generated by our HTML-enabled editor (something along the lines of CLEditor, CKEditor, etc.) or pasted from outside.

My plan is to follow the recipe from this JavaWorld article:

  • use JTidy to convert HTML-formatted string to proper XHTML
  • further modify xhtml2fo.xsl from Antenna House to remove all document-wide and page-wide transformations
  • apply this modified XSLT to my XHTML string (javax.xml.transform)
  • extract all the nodes under the root with XPath (javax.xml.xpath)
  • feed the result directly into existing XSL-FO document

I have a bare-bone version of such code and got the following error:

(Location of error unknown)org.apache.fop1.fo.ValidationException: "{http://www.w3.org/1999/XSL/Format}table-body" is not a valid child of "fo:block"! (No context info available)

My questions:

  1. What would be the way to troubleshoot this issue?
  2. Can <fo:block> serve as a generic container with other objects (including tables) nested inside?
  3. Is this an overall reasonable approach to solving the task?

If someone already "been there done that", please share your experience.

like image 598
PM 77-1 Avatar asked Sep 25 '15 19:09

PM 77-1


People also ask

How to embed a PDF file in HTML?

HTML anchor link is the easiest way to display a PDF file. But if you want to display PDF document on the web page, PDF file need to be embedded in HTML. The HTML <embed> tag is the best option to embed PDF document on the web page.

How to display a PDF file on a web page?

Generally, a hyperlink is used to link a PDF document to display in the browser. HTML anchor link is the easiest way to display a PDF file. But if you want to display PDF document on the web page, PDF file need to be embedded in HTML. The HTML < embed > tag is the best option to embed PDF document on...

Why can't I use the latest HTML and CSS in PDFs?

In those libraries, You should not use the latest HTML or CSS because they will not be rendered all correctly. Always be careful. Sometime the code breaks, things are not displayed in PDF and it takes a lot of time to debug and find a work around in lower level of HTML/CSS. Javascript is not supported. Uhh! All of them have this issue.

How to embed an external file in HTML?

The HTML < embed > tag defines a container to load external content. The following parameters can be specified in < embed > tag. src – Specifies the path of the external file to embed. type – Specifies the media type of the embedded content.


2 Answers

  1. If you use an XSLT debugger such as in oXygen or XML Spy, then you can step through the transformation. With oXygen -- not sure about XML Spy or other editors -- if you click on the markup in the debugger output, oXygen highlights the markup from both the source and the stylesheet that produced that node.

    Once you have the FO, the focheck framework (https://github.com/AntennaHouse/focheck) has the most complete validation of FO currently available.

  2. fo:block can contain tables, etc. In the XSL 1.1 spec, the definition of every FO includes a 'Contents' subsection that lists its allowed content. See, e.g., http://www.w3.org/TR/xsl11/#fo_block. The definitions of the 'parameter entities' in the content models are at http://www.w3.org/TR/xsl11/#d0e6532, but some FOs have additional restrictions in the text of their definitions.

  3. The article that you cite doesn't seem to have the 'extract all the nodes under the root with XPath' step, and I'm not sure why you need it. Other than that, it looks like a reasonable approach for doing the job using Java.


Instead of inserting the FO transformed from your JTidy-ed HTML into the static FO, you could replace your <!-- Field A --> with non-FO markup that provides enough information to make a reference to the field to insert. You can then make an XSLT stylesheet that transforms the template+references document into straight FO by doing an identity transform on the FO parts -- as in the answer from @kevin-brown -- and using the information in the reference markup to construct the URI to use with the document() function (http://www.w3.org/TR/xslt#document) to find the markup to insert.

If the FO for the field content is sitting on the disk, then using document() is straightforward. If it's not, then you'd have to do something like overriding the URIResolver used by the XSLT processor so that, rather than looking on the disk, it does the right thing to retrieve the content. You may even be able to have the JTidying happen as part of the URIResolver retrieving the HTML. You could also do the transformation to FO 'inside' the URIResolver or, also as @kevin-brown suggested, do it as a separate mode. If the transformation is done before or during the URIResolver retrieving the FO, then the 'main' transformation of template+references to FO just needs to extract the right part of the FO sub-document, e.g. document('constructed-URI')/fo:root/fo:page-sequence/*. However, if you're modifying the stylesheet from Antenna House, then you should be able to modify it to not produce an outer fo:root, etc., anyway.

I did something similar years ago with overriding the URI resolver for the libxslt XSLT processor for an XSLT-based server: the context for successive runs of the inner XSLT processor was saved as documents at special URIs and weren't necessarily written to the file system at all.

You could, instead, possibly write an extension function that does the lookup of the references to the fields. The Print and Page Layout Community Group @ W3C, for example, has produced extension functions for multiple XSLT processors that runs an FO processor in the middle of the XSLT transformation to get back the XML for an area tree for the formatted result. See http://www.w3.org/community/ppl/wiki/XSLTExtensions

like image 146
Tony Graham Avatar answered Nov 02 '22 10:11

Tony Graham


The best way to troubleshoot is to use a validating viewer/editor to examine the XSL FO. Many (such as oXygen) will show you errors in XSL FO structure as you open them and they will describe the issue (just as the error reported).

In your case, you obviously have an fo:table-body as a child of fo:block. It cannot be. An fo:table-body have but one valid parent, fo:table. You are either missing the fo:table tag or you have erroneously inserted an fo:block in this position.

In my opinion, I might do things slightly different. I would put the XHTML content inline into the XSL FO right where you want it. Then I would create an identity transform that copies over all the content that is fo-based, but converts the XHTML parts using XSL. This way, you can actually step that transform in an XSL editor like oXygen and see where errors occur and exactly why. Like any other degugger.

Note: You may wish to look at other XSLs also, especially if your HTML may have any style="" CSS attributes. If this is the case it is not simple HTML, then you will need a better method for processing the HTML with CSS to FO.

http://www.cloudformatter.com/css2pdf is based on this complete transform. That general stylesheet is available here: http://xep.cloudformatter.com/doc/XSL/xeponline-fo-translate-2.xsl

I am the author of that stylesheet. It does much more than you ask, but has a fairly complex parsing recursion for converting CSS styling into XSL FO attributes.

like image 43
Kevin Brown Avatar answered Nov 02 '22 08:11

Kevin Brown