Java: Apache POI: Can I get clean text from MS Word (.doc) files?

Tags:

The strings I'm (programmatically) getting from MS Word files when using Apache POI are not the same text I can look at when I open the files with MS Word.

When using the following code:

File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
HWPFDocument wordDoc = new HWPFDocument(inputStrm);
System.out.println(wordDoc.getText());

the output is a single line with many 'invalid' characters (yes, the 'boxes'), and many unwanted strings, like "FORMTEXT", "HYPERLINK \l "_Toc##########"" ('#' being numeric digits), "PAGEREF _Toc########## \h 4", etc.

The following code "fixes" the single-line problem, but maintains all the invalid characters and unwanted text:

File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
WordExtractor wordExtractor = new WordExtractor(inputStrm);
for(String paragraph:wordExtractor.getParagraphText()){
  System.out.println(paragraph);
}

I don't know if I'm using the wrong method for extracting the text, but that's what I've come up with when looking at POI's quick-guide. If I am, what is the correct approach?

If that output is correct, is there a standard way for getting rid of the unwanted text, or will I have to write a filter of my own?

900

asked Apr 20 '12 17:04

XenoRo

1 Answers

This class can read both .doc and .docx files in Java. For this I'm using tika-app-1.2.jar:

/*
 * This class is used to read .doc and .docx files
 * 
 * @author Developer
 *
 */

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.net.URL; 
import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.detect.Detector;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;

class TextExtractor { 
    private OutputStream outputstream;
    private ParseContext context;
    private Detector detector;
    private Parser parser;
    private Metadata metadata;
    private String extractedText;

    public TextExtractor() {
        context = new ParseContext();
        detector = new DefaultDetector();
        parser = new AutoDetectParser(detector);
        context.set(Parser.class, parser);
        outputstream = new ByteArrayOutputStream();
        metadata = new Metadata();
    }

    public void process(String filename) throws Exception {
        URL url;
        File file = new File(filename);
        if (file.isFile()) {
            url = file.toURI().toURL();
        } else {
            url = new URL(filename);
        }
        InputStream input = TikaInputStream.get(url, metadata);
        ContentHandler handler = new BodyContentHandler(outputstream);
        parser.parse(input, handler, metadata, context); 
        input.close();
    }

    public void getString() {
        //Get the text into a String object
        extractedText = outputstream.toString();
        //Do whatever you want with this String object.
        System.out.println(extractedText);
    }

    public static void main(String args[]) throws Exception {
        if (args.length == 1) {
            TextExtractor textExtractor = new TextExtractor();
            textExtractor.process(args[0]);
            textExtractor.getString();
        } else { 
            throw new Exception();
        }
    }
}

To compile:

javac -cp ".:tika-app-1.2.jar" TextExtractor.java

To run:

java -cp ".:tika-app-1.2.jar" TextExtractor SomeWordDocument.doc

175

answered Oct 09 '22 06:10

Vyas

Related questions
                            
                                accessing parameterized type information at runtime [duplicate]
                            
                                Nesting enums in Java
                            
                                output as UTF-8 encoding in java
                            
                                Fit a ImageView (and its src) to the layout width and make its height proportional
                            
                                How to convert HTTP Request Body into JSON Object in Java
                            
                                Ways to achieve effective Java traits?
                            
                                Retry a connection on timeout in Java
                            
                                Why does f:validateDoubleRange only work for @SessionScoped?
                            
                                What are the differences when deploying on Tomcat vs. Websphere?
                            
                                Java's MessageDigest SHA1-algorithm returns different result than SHA1-function of php
                            
                                what if url pattern matches multiple servlets?
                            
                                Creating a maven project
                            
                                Java: How to make a copy of an array of object?
                            
                                DAO and Service?
                            
                                Use case for the task buildNeeded?
                            
                                Architecture: Best practices for manipulating models without polluting the POJOs? And without repeating boiler-plate code everywhere [closed]
                            
                                Spring Security and LDAP authentication
                            
                                readResolve not working ? : an instance of Guava's SerializedForm appears
                            
                                Draw line over multiple panels in Java
                            
                                Accessing a static variable via an object reference in Java

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Java: Apache POI: Can I get clean text from MS Word (.doc) files?

Tags:

java

text

ms-word

extraction

apache-poi

XenoRo

People also ask

1 Answers

Vyas

Recent Activity

Donate For Us