The title may be a little confusing. The simplest method must be judging by extension name just like:
// is represents the InputStream
if (filePath.endsWith("doc")) {
WordExtractor ex = new WordExtractor(is);
text = ex.getText();
ex.close();
} else if(filePath.endsWith("docx")) {
XWPFDocument doc = new XWPFDocument(is);
XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
text = extractor.getText();
extractor.close();
}
This works in most cases. But I have found that for certain file whose extension is doc (a docx file essentially) if you open using winrar, you will find xml files. As it is known that a docx file is a zip file consists of xml files.
I believe this problem must not be rare. But I have not found any information about this. Obviously, judging by extension name to read a doc or docx is not appropriate.
In my case, I have to read a lot of files. And I will even read the doc or docx inside a compressed file, zip, 7z or even rar. Hence, I have to read content by inputStream instead of a File or something else. So how to know whether a file is .docx or .doc format from Apache POI is totally not suitable for my case with ZipInputStream.
What is the best way to judge a file is a doc or docx? I want a solution to read the content from a file which may be doc or docx. But not only just simply judge if it is a doc or docx. Apparently, ZipInpuStream is not a good method for my case. And I believe it is not a appropriate method for others either. Why do I have to judge if the file is doc or docx by an exception?
Using the current stable apache poi version 3.17 you may use FileMagic. But internally this will of course also have a look into the files.
Example:
import java.io.InputStream;
import java.io.FileInputStream;
import java.io.BufferedInputStream;
import org.apache.poi.poifs.filesystem.FileMagic;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
public class ReadWord {
static String read(InputStream is) throws Exception {
System.out.println(FileMagic.valueOf(is));
String text = "";
if (FileMagic.valueOf(is) == FileMagic.OLE2) {
WordExtractor ex = new WordExtractor(is);
text = ex.getText();
ex.close();
} else if(FileMagic.valueOf(is) == FileMagic.OOXML) {
XWPFDocument doc = new XWPFDocument(is);
XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
text = extractor.getText();
extractor.close();
}
return text;
}
public static void main(String[] args) throws Exception {
InputStream is = new BufferedInputStream(new FileInputStream("ExampleOLE.doc")); //really a binary OLE2 Word file
System.out.println(read(is));
is.close();
is = new BufferedInputStream(new FileInputStream("ExampleOOXML.doc")); //a OOXML Word file named *.doc
System.out.println(read(is));
is.close();
is = new BufferedInputStream(new FileInputStream("ExampleOOXML.docx")); //really a OOXML Word file
System.out.println(read(is));
is.close();
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With