On September 28, 2009 the Apache POI project released version 3.5 which officially supports the OOXML formats introduced in Office 2007, like DOCX and XLSX.
Please provide a code sample for extracting a DOCX file's content in plain text, ignoring any styles or formatting.
I am asking this because I have been unable to find any Apache POI examples covering the new OOXML support.
Apache POI provides Java API for manipulating various file formats based on the Office Open XML (OOXML) standard and OLE2 standard from Microsoft.
POI stands For “Poor Obfuscation Implementation”. Apache POI is an API provided by Apache foundation which is a collection of different java libraries. These libraries gives the facility to read, write and manipulate different Microsoft files such as excel sheet, power-point, and word files.
16 September 2022 - POI 5.2.3 available A full list of changes is available in the change log. People interested should also follow the dev list to track progress. See the downloads page for more details. POI requires Java 8 or newer since version 4.0.1.
public class XWPFDocument extends POIXMLDocument implements Document, IBody. High(ish) level class for working with . docx files. This class tries to hide some of the complexity of the underlying file format, but as it's not a mature and stable API yet, certain parts of the XML structure come through.
This worked for me. Make sure you add the required jars (upgrade xmlbeans, etc.)
public String extractText(InputStream in) throws Exception {
XWPFDocument doc = new XWPFDocument(in);
XWPFWordExtractor ex = new XWPFWordExtractor(doc);
String text = ex.getText();
return text;
}
This is more generic
POITextExtractor poitex = ExtractorFactory.createExtractor(in);
return poitex.getText();
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With