Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating PDF from Word (DOC) using Apache POI and iText in JAVA

I am trying to generate a PDF document from a *.doc document. Till now and thanks to stackoverflow I have success generating it but with some problems.

My sample code below generates the pdf without formatations and images, just the text. The document includes blank spaces and images which are not included in the PDF.

Here is the code:

        in = new FileInputStream(sourceFile.getAbsolutePath());
        out = new FileOutputStream(outputFile);

        WordExtractor wd = new WordExtractor(in);

        String text = wd.getText();

        Document pdf= new Document(PageSize.A4);

        PdfWriter.getInstance(pdf, out);

        pdf.open();
        pdf.add(new Paragraph(text));
like image 982
Ismet Avatar asked May 19 '11 14:05

Ismet


3 Answers

docx4j includes code for creating a PDF from a docx using iText. It can also use POI to convert a doc to a docx.

There was a time when we supported both methods equally (as well as PDF via XHTML), but we decided to focus on XSL-FO.

If its an option, you'd be much better off using docx4j to convert a docx to PDF via XSL-FO and FOP.

Use it like so:

        wordMLPackage = WordprocessingMLPackage.load(new java.io.File(inputfilepath));

        // Set up font mapper
        Mapper fontMapper = new IdentityPlusMapper();
        wordMLPackage.setFontMapper(fontMapper);

        // Example of mapping missing font Algerian to installed font Comic Sans MS
        PhysicalFont font 
                = PhysicalFonts.getPhysicalFonts().get("Comic Sans MS");
        fontMapper.getFontMappings().put("Algerian", font);             

        org.docx4j.convert.out.pdf.PdfConversion c 
            = new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(wordMLPackage);
        //  = new org.docx4j.convert.out.pdf.viaIText.Conversion(wordMLPackage);

        OutputStream os = new java.io.FileOutputStream(inputfilepath + ".pdf");         
        c.output(os);

Update July 2016

As of docx4j 3.3.0, Plutext's commercial PDF renderer is docx4j's default option for docx to PDF conversion. You can try an online demo at converter-eval.plutext.com

If you want to use the existing docx to XSL-FO to PDF (or other target supported by Apache FOP) approach, then just add the docx4j-export-FO jar to your classpath.

Either way, to convert docx to PDF, you can use the Docx4J facade's toPDF method.

The old docx to PDF via iText code can be found at https://github.com/plutext/docx4j-export-FO/.../docx4j-extras/PdfViaIText/

like image 89
JasonPlutext Avatar answered Nov 07 '22 10:11

JasonPlutext


WordExtractor just grabs the plain text, nothing else. That's why all you're seeing is the plain text.

What you'll need to do is get each paragraph individually, then grab each run, fetch the formatting, and generate the equivalent in PDF.

One option may be to find some code that turns XHTML into a PDF. Then, use Apache Tika to turn your word document into XHTML (it uses POI under the hood, and handles all the formatting stuff for you), and from the XHTML on to PDF.

Otherwise, if you're going to do it yourself, take a look at the code in Apache Tika for parsing word files. It's a really great example of how to get at the images, the formatting, the styles etc.

like image 2
Gagravarr Avatar answered Nov 07 '22 09:11

Gagravarr


I have succesfully used Apache FOP to convert a 'WordML' document to PDF. WordML is the Office 2003 way of saving a Word document as xml. XSLT stylesheets can be found on the web to transform this xml to xml-fo which in turn can be rendered by FOP into PDF (among other outputs).

It's not so different from the solution plutext offered, except that it doesn't read a .doc document, whereas docx4j apparently does. If your requirements are flexible enough to have WordML style documents as input, this might be worth looking into.

Good luck with your project! Wim

like image 1
Wivani Avatar answered Nov 07 '22 11:11

Wivani