Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract plain text from a DOCX file using the new OOXML support in Apache POI 3.5?

On September 28, 2009 the Apache POI project released version 3.5 which officially supports the OOXML formats introduced in Office 2007, like DOCX and XLSX.

Please provide a code sample for extracting a DOCX file's content in plain text, ignoring any styles or formatting.

I am asking this because I have been unable to find any Apache POI examples covering the new OOXML support.

like image 617
Robert Campbell Avatar asked Sep 29 '09 14:09

Robert Campbell


People also ask

What is the use of poi ooxml?

Apache POI provides Java API for manipulating various file formats based on the Office Open XML (OOXML) standard and OLE2 standard from Microsoft.

What does POI stand for in Apache POI?

POI stands For “Poor Obfuscation Implementation”. Apache POI is an API provided by Apache foundation which is a collection of different java libraries. These libraries gives the facility to read, write and manipulate different Microsoft files such as excel sheet, power-point, and word files.

What is the latest version of Apache POI?

16 September 2022 - POI 5.2.3 available A full list of changes is available in the change log. People interested should also follow the dev list to track progress. See the downloads page for more details. POI requires Java 8 or newer since version 4.0.1.

What is XWPFDocument?

public class XWPFDocument extends POIXMLDocument implements Document, IBody. High(ish) level class for working with . docx files. This class tries to hide some of the complexity of the underlying file format, but as it's not a mature and stable API yet, certain parts of the XML structure come through.


2 Answers

This worked for me. Make sure you add the required jars (upgrade xmlbeans, etc.)

public String extractText(InputStream in) throws Exception {
    XWPFDocument doc = new XWPFDocument(in);
    XWPFWordExtractor ex = new XWPFWordExtractor(doc);
    String text = ex.getText();
    return text;
}
like image 79
Tanuj Chatterjee Avatar answered Nov 04 '22 04:11

Tanuj Chatterjee


This is more generic

POITextExtractor poitex = ExtractorFactory.createExtractor(in);

return poitex.getText();

like image 7
Tanuj Chatterjee Avatar answered Nov 04 '22 02:11

Tanuj Chatterjee