Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java: Apache POI: Can I get clean text from MS Word (.doc) files?

The strings I'm (programmatically) getting from MS Word files when using Apache POI are not the same text I can look at when I open the files with MS Word.

When using the following code:

File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
HWPFDocument wordDoc = new HWPFDocument(inputStrm);
System.out.println(wordDoc.getText());

the output is a single line with many 'invalid' characters (yes, the 'boxes'), and many unwanted strings, like "FORMTEXT", "HYPERLINK \l "_Toc##########"" ('#' being numeric digits), "PAGEREF _Toc########## \h 4", etc.

The following code "fixes" the single-line problem, but maintains all the invalid characters and unwanted text:

File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
WordExtractor wordExtractor = new WordExtractor(inputStrm);
for(String paragraph:wordExtractor.getParagraphText()){
  System.out.println(paragraph);
}

I don't know if I'm using the wrong method for extracting the text, but that's what I've come up with when looking at POI's quick-guide. If I am, what is the correct approach?

If that output is correct, is there a standard way for getting rid of the unwanted text, or will I have to write a filter of my own?

like image 900
XenoRo Avatar asked Apr 20 '12 17:04

XenoRo


People also ask

How do I clean up text in Word?

On the Home tab, in the Font group, click Clear All Formatting. On the Home tab, in the Font group, click Clear All Formatting. On the Message tab, in the Basic Text group, click Clear All Formatting. On the Home tab, in the Basic Text group, click Clear All Formatting.

How do I extract a .DOC file?

To extract the contents of the file, right-click on the file and select “Extract All” from the popup menu. On the “Select a Destination and Extract Files” dialog box, the path where the content of the .

Can Java read Word document?

Syncfusion Java Word library (Essential DocIO) is used to create, read and edit Word documents programmatically without Microsoft Word or interop dependencies. Using this library, you can read and edit Word document in Java application.


1 Answers

This class can read both .doc and .docx files in Java. For this I'm using tika-app-1.2.jar:

/*
 * This class is used to read .doc and .docx files
 * 
 * @author Developer
 *
 */

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.net.URL; 
import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.detect.Detector;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;

class TextExtractor { 
    private OutputStream outputstream;
    private ParseContext context;
    private Detector detector;
    private Parser parser;
    private Metadata metadata;
    private String extractedText;

    public TextExtractor() {
        context = new ParseContext();
        detector = new DefaultDetector();
        parser = new AutoDetectParser(detector);
        context.set(Parser.class, parser);
        outputstream = new ByteArrayOutputStream();
        metadata = new Metadata();
    }

    public void process(String filename) throws Exception {
        URL url;
        File file = new File(filename);
        if (file.isFile()) {
            url = file.toURI().toURL();
        } else {
            url = new URL(filename);
        }
        InputStream input = TikaInputStream.get(url, metadata);
        ContentHandler handler = new BodyContentHandler(outputstream);
        parser.parse(input, handler, metadata, context); 
        input.close();
    }

    public void getString() {
        //Get the text into a String object
        extractedText = outputstream.toString();
        //Do whatever you want with this String object.
        System.out.println(extractedText);
    }

    public static void main(String args[]) throws Exception {
        if (args.length == 1) {
            TextExtractor textExtractor = new TextExtractor();
            textExtractor.process(args[0]);
            textExtractor.getString();
        } else { 
            throw new Exception();
        }
    }
}

To compile:

javac -cp ".:tika-app-1.2.jar" TextExtractor.java

To run:

java -cp ".:tika-app-1.2.jar" TextExtractor SomeWordDocument.doc
like image 175
Vyas Avatar answered Oct 09 '22 06:10

Vyas