I am trying to read doc and docx files. here is the code:
static String distination="E:\\
static String docFileName="Requirements.docx";
public static void main(String[] args) throws FileNotFoundException, IOException {
// TODO code application logic here
ReadFile rf= new ReadFile();
rf.ReadFileParagraph(distination+docFileName);
}
public void ReadFileParagraph(String path) throws FileNotFoundException, IOException
{
FileInputStream fis;
File file = new File(path);
fis=new FileInputStream(file.getAbsolutePath());
String filename=file.getName();
String fileExtension=fileExtension(path);
if(fileExtension.equals("doc"))
{
HWPFDocument document=new HWPFDocument(fis);
WordExtractor DocExtractor = new WordExtractor(document);
ReadDocFile(DocExtractor,filename);
}
else if(fileExtension.equals("docx"))
{
XWPFDocument documentX = new XWPFDocument(fis);
List<XWPFParagraph> pera =documentX.getParagraphs();
ReadDocXFile(pera,filename);
}
else
{
System.out.println("format does not match");
}
}
public void ReadDocFile(WordExtractor extractor,String filename)
{
for (String paragraph : extractor.getParagraphText()) {
System.out.println("Peragraph: "+paragraph);
}
}
public void ReadDocXFile(List<XWPFParagraph> extractor,String filename)
{
for (XWPFParagraph paragraph : extractor) {
System.out.println("Question: "+paragraph.getParagraphText());
}
}
public String fileExtension(String filename)
{
String extension = filename.substring(filename.lastIndexOf(".") + 1, filename.length());
return extension;
}
this code give an exception when I want to read a docx file:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/xmlbeans/XmlException
at l3s.readfiles.db.ReadFile.ReadFileParagraph(ReadFile.java:52)
at autometictagdetection.TagDetection.main(TagDetection.java:36)
Caused by: java.lang.ClassNotFoundException: org.apache.xmlbeans.XmlException
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
... 2 more
Java Result: 1
Another problem is when I want to read a Doc file, it read some file very well but for some file it gives an exception like that
Exception in thread "main" org.apache.poi.hwpf.OldWordFileFormatException: The document is too old - Word 95 or older. Try HWPFOldDocument instead?
at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:222)
at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:186)
at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:174)
at l3s.readfiles.db.ReadFile.ReadFileParagraph(ReadFile.java:44)
at autometictagdetection.TagDetection.main(TagDetection.java:36)
Java Result: 1
I saw that POI API support word 6 and word 95 in http://poi.apache.org/hwpf/index.html. Please anybody can give a solution of this two problems?
If you are working with tools where you have to open the document by clicking on it, you can use the java. awt. Desktop API to easily open the document by passing the file object.
In java programming language we normally use the POI Library to read the word document file. For doing this we will make class HWPFDocument which throw all of the Word file data and the class WordExtractor extract the text from the Word Document.
core maven dependencies required this is the solution to Problem Number 1
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>3.15</version>
</dependency>
<!-- For .DOCX FILES -->
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>3.15</version>
</dependency>
<!-- For .DOC FILES -->
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-scratchpad</artifactId>
<version>3.9</version>
</dependency>
For Problem 2 From the original source code , seems POI doesn't support documents way too old
/**
* This constructor loads a Word document from a specific point
* in a POIFSFileSystem, probably not the default.
* Used typically to open embeded documents.
*
* @param directory The DirectoryNode that contains the Word document.
* @throws IOException If there is an unexpected IOException from the passed
* in POIFSFileSystem.
*/
public HWPFDocument(DirectoryNode directory) throws IOException
{
// Load the main stream and FIB
// Also handles HPSF bits
super(directory);
// Is this document too old for us?
if(_fib.getFibBase().getNFib() < 106) {
throw new OldWordFileFormatException("The document is too old - Word 95 or older. Try HWPFOldDocument instead?");
}
Source code for HWPFDocument
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With