Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache POI: Extract a paragraph and the table that follows from word document (docx) in java

I have a bunch of word documents (docx) that details test case name as a paragraph title and the test steps in the subsequent table along with some other information.

I need to extract the test case name (from paragraph) and the test steps (from table) from the table using Apache POI.

The example word contents are

Section 1: Index
Section 2: Some description
    A. Paragraph 1
    B. Table 1
    C. Paragraph 2
    D. Paragraph 3
    E. Table 2
Section 3: test cases ( The title "test cases" is constant, so I can look for it in the doc)
    A. Paragraph 4 (First test case)
    B. Table 3 (Test steps table immediately after the para 4)
    C. Paragraph 5 (Second test case)
    B. Table 4 (Test steps table immediately after the para 5)

Apache POI provides APIs to give list of paragraphs and tables but I am not able to read the paragraph (test case) and immediately look for a table that follows this paragraph.

I tried using XWPFWordExtractor (to read all the text), bodyElementIterator (to iterate over all the body elements) but most of them give getParagraphText() method that gives a list of paragraphs [para1, para2, para3, para4, para5] and getTables() method that gives all the tables in the document as a list [table1, table2, table3, table4].

How do I go over all paragraphs, stop at paragraph that is after the heading 'test cases' (paragraph 4) and then look for table that is immediately after the paragraph 4 (table 3). Then repeat this for paragraph 5 and table 4.

Here is the gist link (code) I tried that gives a list of paragraphs and list of tables but not in the sequence that I can track.

Any help is much appreciated.

like image 496
Sauchin Avatar asked Jun 02 '16 17:06

Sauchin


People also ask

What is Apache POI DOCX?

Apache POI is a Java library for working with the various file formats based on the Office Open XML standards (OOXML) and Microsoft's OLE 2 Compound Document format (OLE2). This tutorial focuses on the support of Apache POI for Microsoft Word, the most commonly used Office file format.


1 Answers

The Word API in POI is still in flux, and buggy, but you should be able to iterate over the paragraphs in one of two ways:

XWPFDocument doc = new XWPFDocument(fis);
List<XWPFParagraph> paragraphs = doc.getParagraphs();
for (XWPFParagraph p : paragraphs) {
   ... do something here
}

or

XWPFDocument doc = new XWPFDocument(fis);
Iterator<XWPFParagraph> iter = doc.getParagraphsIterator();
while (iter.hasNext()) {
   XWPFParagraph p = iter.next();
   ... do something here
}

The Javadocs say that XWPFDocument.getParagraphs() retrieves the paragraphs that hold the text in in the header or footer, but I have to believe that this is a cut and paste error as the XWPFHeaderFooter.getParagraphs() says the same thing. Looking at the source, XWPFDocument.getParagraphs() returns an unmodifiable list while using the iterator leaves the paragraphs modifiable. This is likely to change in the future, but it is the way it works for now.

To retrieve a list of all body elements, Paragraphs and Tables, you need to use:

XWPFDocument doc = new XWPFDocument(fis);
Iterator<IBodyElement> iter = doc.getBodyElementsIterator();
while (iter.hasNext()) {
   IBodyElement elem = iter.next();
   if (elem instanceof XWPFParagraph) {
      ... do something here
   } else if (elem instanceof XWPFTable) {
      ... do something here
   }
}

This should allow you to loop through all body elements in order.

like image 198
jmarkmurphy Avatar answered Oct 06 '22 01:10

jmarkmurphy