I am trying to read doc and docx files. here is the code: <pre class="prettyprint"><code> static String distination="E:\\ static String docFileName="Requirements.docx"; public static void main(String[] args) throws FileNotFoundException, IOException { // TODO code application logic here ReadFile rf= new ReadFile(); rf.ReadFileParagraph(distination+docFileName); } public void ReadFileParagraph(String path) throws FileNotFoundException, IOException { FileInputStream fis; File file = new File(path); fis=new FileInputStream(file.getAbsolutePath()); String filename=file.getName(); String fileExtension=fileExtension(path); if(fileExtension.equals("doc")) { HWPFDocument document=new HWPFDocument(fis); WordExtractor DocExtractor = new WordExtractor(document); ReadDocFile(DocExtractor,filename); } else if(fileExtension.equals("docx")) { XWPFDocument documentX = new XWPFDocument(fis); List<XWPFParagraph> pera =documentX.getParagraphs(); ReadDocXFile(pera,filename); } else { System.out.println("format does not match"); } } public void ReadDocFile(WordExtractor extractor,String filename) { for (String paragraph : extractor.getParagraphText()) { System.out.println("Peragraph: "+paragraph); } } public void ReadDocXFile(List<XWPFParagraph> extractor,String filename) { for (XWPFParagraph paragraph : extractor) { System.out.println("Question: "+paragraph.getParagraphText()); } } public String fileExtension(String filename) { String extension = filename.substring(filename.lastIndexOf(".") + 1, filename.length()); return extension; } </code></pre> this code give an exception when I want to read a docx file: <pre class="prettyprint"><code>Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/xmlbeans/XmlException at l3s.readfiles.db.ReadFile.ReadFileParagraph(ReadFile.java:52) at autometictagdetection.TagDetection.main(TagDetection.java:36) Caused by: java.lang.ClassNotFoundException: org.apache.xmlbeans.XmlException at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:423) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:356) ... 2 more Java Result: 1 </code></pre> Another problem is when I want to read a Doc file, it read some file very well but for some file it gives an exception like that <pre class="prettyprint"><code> Exception in thread "main" org.apache.poi.hwpf.OldWordFileFormatException: The document is too old - Word 95 or older. Try HWPFOldDocument instead? at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:222) at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:186) at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:174) at l3s.readfiles.db.ReadFile.ReadFileParagraph(ReadFile.java:44) at autometictagdetection.TagDetection.main(TagDetection.java:36) Java Result: 1 </code></pre> I saw that POI API support word 6 and word 95 in http://poi.apache.org/hwpf/index.html. Please anybody can give a solution of this two problems?

core maven dependencies required this is the solution to Problem Number 1 <pre class="prettyprint"><code><dependency> <groupId>org.apache.poi</groupId> <artifactId>poi</artifactId> <version>3.15</version> </dependency>  <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-ooxml</artifactId> <version>3.15</version> </dependency>  <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-scratchpad</artifactId> <version>3.9</version> </dependency> </code></pre> <blockquote> For Problem 2 From the original source code , seems POI doesn't support documents way too old </blockquote> <pre class="prettyprint"><code> /** * This constructor loads a Word document from a specific point * in a POIFSFileSystem, probably not the default. * Used typically to open embeded documents. * * @param directory The DirectoryNode that contains the Word document. * @throws IOException If there is an unexpected IOException from the passed * in POIFSFileSystem. */ public HWPFDocument(DirectoryNode directory) throws IOException { // Load the main stream and FIB // Also handles HPSF bits super(directory); // Is this document too old for us? if(_fib.getFibBase().getNFib() < 106) { throw new OldWordFileFormatException("The document is too old - Word 95 or older. Try HWPFOldDocument instead?"); } </code></pre> Source code for HWPFDocument

How to read doc and docx file in java with POI api

Tags:

java

apache-poi

docx

doc

I am trying to read doc and docx files. here is the code:

  static String distination="E:\\         
  static String docFileName="Requirements.docx";
 public static void main(String[] args) throws FileNotFoundException, IOException {
    // TODO code application logic here
    ReadFile rf= new ReadFile();
    rf.ReadFileParagraph(distination+docFileName);


  }
  public void ReadFileParagraph(String path) throws FileNotFoundException, IOException
    {
        FileInputStream fis;
        File file = new File(path);
        fis=new FileInputStream(file.getAbsolutePath());
           String filename=file.getName();

        String fileExtension=fileExtension(path);
        if(fileExtension.equals("doc"))
        {
             HWPFDocument document=new HWPFDocument(fis);
             WordExtractor DocExtractor = new WordExtractor(document);
             ReadDocFile(DocExtractor,filename);

        }
        else if(fileExtension.equals("docx"))
        {

            XWPFDocument documentX = new XWPFDocument(fis);            
            List<XWPFParagraph> pera =documentX.getParagraphs();
            ReadDocXFile(pera,filename);
        }
        else
        {
            System.out.println("format does not match");
        }

    }
    public void ReadDocFile(WordExtractor extractor,String filename)
    {

        for (String paragraph : extractor.getParagraphText()) {
            System.out.println("Peragraph: "+paragraph);
        }
    }
    public void ReadDocXFile(List<XWPFParagraph> extractor,String filename)
    {

        for (XWPFParagraph paragraph : extractor) {
          System.out.println("Question: "+paragraph.getParagraphText());
        }

    }
    public String fileExtension(String filename)
    {

       String extension = filename.substring(filename.lastIndexOf(".") + 1, filename.length());
       return extension;
    }

this code give an exception when I want to read a docx file:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/xmlbeans/XmlException
    at l3s.readfiles.db.ReadFile.ReadFileParagraph(ReadFile.java:52)
    at autometictagdetection.TagDetection.main(TagDetection.java:36)
Caused by: java.lang.ClassNotFoundException: org.apache.xmlbeans.XmlException
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
    ... 2 more
Java Result: 1

Another problem is when I want to read a Doc file, it read some file very well but for some file it gives an exception like that

    Exception in thread "main" org.apache.poi.hwpf.OldWordFileFormatException: The               document is too old - Word 95 or older. Try HWPFOldDocument instead?
    at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:222)
    at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:186)
    at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:174)
    at l3s.readfiles.db.ReadFile.ReadFileParagraph(ReadFile.java:44)
    at autometictagdetection.TagDetection.main(TagDetection.java:36)
Java Result: 1

I saw that POI API support word 6 and word 95 in http://poi.apache.org/hwpf/index.html. Please anybody can give a solution of this two problems?

744

asked Jul 12 '13 14:07

Khaled

1 Answers

core maven dependencies required this is the solution to Problem Number 1

<dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi</artifactId>
            <version>3.15</version>
        </dependency>
        <!-- For .DOCX FILES -->
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml</artifactId>
            <version>3.15</version>
        </dependency>
       <!-- For .DOC FILES -->
        <dependency>
           <groupId>org.apache.poi</groupId>
            <artifactId>poi-scratchpad</artifactId>
            <version>3.9</version>
        </dependency>

For Problem 2 From the original source code , seems POI doesn't support documents way too old

  /**
   * This constructor loads a Word document from a specific point
   *  in a POIFSFileSystem, probably not the default.
   * Used typically to open embeded documents.
   *
   * @param directory The DirectoryNode that contains the Word document.
   * @throws IOException If there is an unexpected IOException from the passed
   *         in POIFSFileSystem.
   */
  public HWPFDocument(DirectoryNode directory) throws IOException
  {
    // Load the main stream and FIB
    // Also handles HPSF bits
    super(directory);

    // Is this document too old for us?
    if(_fib.getFibBase().getNFib() < 106) {
        throw new OldWordFileFormatException("The document is too old - Word 95 or older. Try HWPFOldDocument instead?");
    }

Source code for HWPFDocument

150

answered Oct 19 '22 02:10

Yugansh

Related questions
                            
                                Reading an ESRI shapefile from a zip-file during Runtime in Java - DataStoreFinder.getDataStore(connectParameters) returns null
                            
                                Any way to send %2b (encoded plus sign) in query arg with java.net.URI?
                            
                                Run Junit test class inside one-jar with junit outside the jar
                            
                                Error Deploying Java WAR File: SEVERE: Exception fixing docBase for context
                            
                                Is there any design pattern to switch between data depending on the device type?
                            
                                Scala and Java Generics -- Extracting and returning nested types
                            
                                Java does not produce correct time zone information
                            
                                How does wait know about interrupt in Java?
                            
                                Java Model Objects design
                            
                                How Buffered Streams works?
                            
                                How do I build and deploy a Remote EJB with Maven?
                            
                                Memory leak caused by Logger
                            
                                Performance costs of having a transaction over multiple EJBs vs. one EJB
                            
                                Failed while installing JAX-RS (REST Web Services) 1.1. java.lang.NullPointerException
                            
                                Why eclipse doesn't see implemented interfaces?
                            
                                Netbeans IDE not issuing warnings about methods called in constructors
                            
                                error: annotation type not applicable to this kind of declaration
                            
                                How to check if two Box2d bodies collision / overlap at any moment?
                            
                                Best delimiter to separate multipe regex
                            
                                How to create zip with lzma compression

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With