Hi i am trying to read text from doc and docx file, for doc files i am doing this
package test;
import java.io.File;
import java.io.FileInputStream;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class ReadFile {
public static void main(String[] args) {
File file = null;
WordExtractor extractor = null;
try {
file = new File("C:\\Users\\rijo\\Downloads\\r.doc");
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
HWPFDocument document = new HWPFDocument(fis);
extractor = new WordExtractor(document);
String fileData = extractor.getText();
System.out.println(fileData);
} catch (Exception exep) {
}
}
}
But this gives me an org/apache/poi/OldFileFormatException exception.
Any idea how to fix this?
Also I need to read Docx and PDF files ? any good way to read all type of files?
Using the following jars (In case version numbers are playing a role here):
dom4j-1.7-20060614
poi-3.9-20121203
poi-ooxml-3.9-20121203
poi-ooxml-schemas-3.9-20121203
poi-scratchpad-3.9-20121203
xmlbeans-2.4.0
I typed this up:
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class SO {
public static void main(String[] args){
//Alternate between the two to check what works.
//String FilePath = "D:\\Users\\username\\Desktop\\Doc1.docx";
String FilePath = "D:\\Users\\username\\Desktop\\Bob.doc";
FileInputStream fis;
if(FilePath.substring(FilePath.length() -1).equals("x")){ //is a docx
try {
fis = new FileInputStream(new File(FilePath));
XWPFDocument doc = new XWPFDocument(fis);
XWPFWordExtractor extract = new XWPFWordExtractor(doc);
System.out.println(extract.getText());
} catch (IOException e) {
e.printStackTrace();
}
} else { //is not a docx
try {
fis = new FileInputStream(new File(FilePath));
HWPFDocument doc = new HWPFDocument(fis);
WordExtractor extractor = new WordExtractor(doc);
System.out.println(extractor.getText());
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
this allowed me to read text from both a .docx and .doc respectively. If this doesn't work on your PC you may well have either an issue with the external jars you are using.
Give it a go though :) Good luck!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With