Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert .docx to HTML using JAVA

I tried converting .doc to HTML by using WordToHtmlConverter and it worked perfectly.

But when i tried to convert .docx to HTML, i got stuck with it.

What i tried:

I used the below code to convert .docx to HTML:

The code which i tried from : How to use Tika's XWPFWordExtractorDecorator class?

        InputStream input = TikaInputStream.get(new File("C:\\Users\\Downloads\\filename.docx"));


        Parser parser = new AutoDetectParser();


        StringWriter sw = new StringWriter();
        SAXTransformerFactory factory = (SAXTransformerFactory)
                 SAXTransformerFactory.newInstance();
        TransformerHandler handler = factory.newTransformerHandler();
        handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
        handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
        handler.setResult(new StreamResult(sw));


        try {
            Metadata metadata = new Metadata();
            parser.parse(input, handler, metadata, new ParseContext());
            String xml = sw.toString();
            System.out.print("tika : "+xml); 
        } finally {
            input.close();
        }

The output what i got is,

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
</head>
<body/>
</html>
  • Please explain where i gone wrong?
  • Is there any better way to convert .docx to html string

Appreciate your help, Thanks

like image 642
Vignesh Paramasivam Avatar asked Jul 09 '14 11:07

Vignesh Paramasivam


People also ask

Can you convert Java to HTML?

Find and select the JAVA files on your computer and click Open to bring them into Doxillion to convert them to the HTML file format. You can also drag and drop your JAVA files directly into the program to convert them as well.


2 Answers

This code worked for me to convert .docx to html:

You can also look at the link : Link to code

       //convert .docx to HTML string
        InputStream in= new FileInputStream(new File(path));
        XWPFDocument document = new XWPFDocument(in);


        XHTMLOptions options = XHTMLOptions.create().URIResolver(new FileURIResolver(new File("word/media")));

        OutputStream out = new ByteArrayOutputStream();


        XHTMLConverter.getInstance().convert(document, out, options);
        String html=out.toString();
        System.out.println(html);
like image 168
Vignesh Paramasivam Avatar answered Nov 07 '22 05:11

Vignesh Paramasivam


You may want to make use of Mammoth docx to HTML library.Its a library for displaying doc, docx documents by converting them to html on the browser side as well as can be handled on the backend.

  • Library Supports - JavaScript, both the browser and node.js. Available on npm. Python. Available on PyPI. WordPress. Java/JVM. Available on Maven Central. .NET. Available on NuGet.
  • Link: https://mike.zwobble.org/projects/mammoth/ (Demo and Article)
  • Github: https://github.com/mwilliamson/mammoth.js
like image 42
Rakshit Singh Avatar answered Nov 07 '22 05:11

Rakshit Singh