Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read doc and docx file in java with POI api

I am trying to read doc and docx files. here is the code:

  static String distination="E:\\         
  static String docFileName="Requirements.docx";
 public static void main(String[] args) throws FileNotFoundException, IOException {
    // TODO code application logic here
    ReadFile rf= new ReadFile();
    rf.ReadFileParagraph(distination+docFileName);


  }
  public void ReadFileParagraph(String path) throws FileNotFoundException, IOException
    {
        FileInputStream fis;
        File file = new File(path);
        fis=new FileInputStream(file.getAbsolutePath());
           String filename=file.getName();

        String fileExtension=fileExtension(path);
        if(fileExtension.equals("doc"))
        {
             HWPFDocument document=new HWPFDocument(fis);
             WordExtractor DocExtractor = new WordExtractor(document);
             ReadDocFile(DocExtractor,filename);

        }
        else if(fileExtension.equals("docx"))
        {

            XWPFDocument documentX = new XWPFDocument(fis);            
            List<XWPFParagraph> pera =documentX.getParagraphs();
            ReadDocXFile(pera,filename);
        }
        else
        {
            System.out.println("format does not match");
        }

    }
    public void ReadDocFile(WordExtractor extractor,String filename)
    {

        for (String paragraph : extractor.getParagraphText()) {
            System.out.println("Peragraph: "+paragraph);
        }
    }
    public void ReadDocXFile(List<XWPFParagraph> extractor,String filename)
    {

        for (XWPFParagraph paragraph : extractor) {
          System.out.println("Question: "+paragraph.getParagraphText());
        }

    }
    public String fileExtension(String filename)
    {

       String extension = filename.substring(filename.lastIndexOf(".") + 1, filename.length());
       return extension;
    }

this code give an exception when I want to read a docx file:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/xmlbeans/XmlException
    at l3s.readfiles.db.ReadFile.ReadFileParagraph(ReadFile.java:52)
    at autometictagdetection.TagDetection.main(TagDetection.java:36)
Caused by: java.lang.ClassNotFoundException: org.apache.xmlbeans.XmlException
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
    ... 2 more
Java Result: 1

Another problem is when I want to read a Doc file, it read some file very well but for some file it gives an exception like that

    Exception in thread "main" org.apache.poi.hwpf.OldWordFileFormatException: The               document is too old - Word 95 or older. Try HWPFOldDocument instead?
    at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:222)
    at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:186)
    at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:174)
    at l3s.readfiles.db.ReadFile.ReadFileParagraph(ReadFile.java:44)
    at autometictagdetection.TagDetection.main(TagDetection.java:36)
Java Result: 1

I saw that POI API support word 6 and word 95 in http://poi.apache.org/hwpf/index.html. Please anybody can give a solution of this two problems?

like image 744
Khaled Avatar asked Jul 12 '13 14:07

Khaled


People also ask

How do I open a .doc file in Java?

If you are working with tools where you have to open the document by clicking on it, you can use the java. awt. Desktop API to easily open the document by passing the file object.

Can Java read Word document?

In java programming language we normally use the POI Library to read the word document file. For doing this we will make class HWPFDocument which throw all of the Word file data and the class WordExtractor extract the text from the Word Document.


1 Answers

core maven dependencies required this is the solution to Problem Number 1

<dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi</artifactId>
            <version>3.15</version>
        </dependency>
        <!-- For .DOCX FILES -->
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml</artifactId>
            <version>3.15</version>
        </dependency>
       <!-- For .DOC FILES -->
        <dependency>
           <groupId>org.apache.poi</groupId>
            <artifactId>poi-scratchpad</artifactId>
            <version>3.9</version>
        </dependency>

For Problem 2 From the original source code , seems POI doesn't support documents way too old

  /**
   * This constructor loads a Word document from a specific point
   *  in a POIFSFileSystem, probably not the default.
   * Used typically to open embeded documents.
   *
   * @param directory The DirectoryNode that contains the Word document.
   * @throws IOException If there is an unexpected IOException from the passed
   *         in POIFSFileSystem.
   */
  public HWPFDocument(DirectoryNode directory) throws IOException
  {
    // Load the main stream and FIB
    // Also handles HPSF bits
    super(directory);

    // Is this document too old for us?
    if(_fib.getFibBase().getNFib() < 106) {
        throw new OldWordFileFormatException("The document is too old - Word 95 or older. Try HWPFOldDocument instead?");
    }

Source code for HWPFDocument

like image 150
Yugansh Avatar answered Oct 19 '22 02:10

Yugansh