Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java reading .doc file using POI

Tags:

java

Hi i am trying to read text from doc and docx file, for doc files i am doing this

package test;
import java.io.File;
import java.io.FileInputStream;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;

public class ReadFile {
public static void main(String[] args) {
        File file = null;
        WordExtractor extractor = null;
        try {

            file = new File("C:\\Users\\rijo\\Downloads\\r.doc");
            FileInputStream fis = new FileInputStream(file.getAbsolutePath());
            HWPFDocument document = new HWPFDocument(fis);
            extractor = new WordExtractor(document);
            String fileData = extractor.getText();
            System.out.println(fileData);
        } catch (Exception exep) {
        }
    }
}

But this gives me an org/apache/poi/OldFileFormatException exception.

Any idea how to fix this?

Also I need to read Docx and PDF files ? any good way to read all type of files?

like image 487
Rijo Joseph Avatar asked Dec 15 '25 02:12

Rijo Joseph


1 Answers

Using the following jars (In case version numbers are playing a role here):

dom4j-1.7-20060614
poi-3.9-20121203
poi-ooxml-3.9-20121203
poi-ooxml-schemas-3.9-20121203
poi-scratchpad-3.9-20121203
xmlbeans-2.4.0

I typed this up:

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;

import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;

public class SO {
public static void main(String[] args){

            //Alternate between the two to check what works.
    //String FilePath = "D:\\Users\\username\\Desktop\\Doc1.docx";
    String FilePath = "D:\\Users\\username\\Desktop\\Bob.doc";
    FileInputStream fis;

    if(FilePath.substring(FilePath.length() -1).equals("x")){ //is a docx
    try {
        fis = new FileInputStream(new File(FilePath));
        XWPFDocument doc = new XWPFDocument(fis);
        XWPFWordExtractor extract = new XWPFWordExtractor(doc);
        System.out.println(extract.getText());
    } catch (IOException e) {

        e.printStackTrace();
    }
    } else { //is not a docx
        try {
            fis = new FileInputStream(new File(FilePath));
            HWPFDocument doc = new HWPFDocument(fis);
            WordExtractor extractor = new WordExtractor(doc);
            System.out.println(extractor.getText());
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
  }
}

this allowed me to read text from both a .docx and .doc respectively. If this doesn't work on your PC you may well have either an issue with the external jars you are using.

Give it a go though :) Good luck!

like image 187
Levenal Avatar answered Dec 16 '25 20:12

Levenal



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!