How to extract plain text from a DOCX file using the new OOXML support in Apache POI 3.5?

2 Answers

This worked for me. Make sure you add the required jars (upgrade xmlbeans, etc.)

public String extractText(InputStream in) throws Exception {
    XWPFDocument doc = new XWPFDocument(in);
    XWPFWordExtractor ex = new XWPFWordExtractor(doc);
    String text = ex.getText();
    return text;
}

answered Nov 04 '22 04:11

Tanuj Chatterjee

This is more generic

POITextExtractor poitex = ExtractorFactory.createExtractor(in);

return poitex.getText();

answered Nov 04 '22 02:11

Tanuj Chatterjee

Related questions
                            
                                Convert DOC to PDF from Command Line [closed]
                            
                                pandoc convert html with style sheet to docx
                            
                                PhpWord doesn't replace text
                            
                                Text-Replace in docx and save the changed file with python-docx
                            
                                Python docx Lib Center Align image
                            
                                Is there any java library (maybe poi?) which allows to merge docx files? [closed]
                            
                                Find and replace text in .docx file - Python
                            
                                python docx.opc.exceptions.PackageNotFoundError: Package not found when opening Document
                            
                                Programmatically convert Word (docx) to PDF
                            
                                How to change page size to A4 in python-docx
                            
                                how to Show or Read docx file
                            
                                Creating FullText Index on Docx files in a FileTable
                            
                                Is there a glossary of Word .docx XML tags?
                            
                                Read and replace contents in .docx (Word) file
                            
                                How to create *.docx files from a template in C#
                            
                                Jinja docx template, avoiding new line in nested for
                            
                                How to setup cell borders with python-docx
                            
                                Convert Word doc or docx files into text files?
                            
                                Changing the Pandoc monospace font size or style in DOCX output
                            
                                How to convert .docx to .odt with Libreoffice on Ubuntu bash

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to extract plain text from a DOCX file using the new OOXML support in Apache POI 3.5?

Tags:

xlsx

apache-poi

docx

openxml

Robert Campbell

People also ask

2 Answers

Tanuj Chatterjee

Tanuj Chatterjee

Recent Activity

Donate For Us