Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract Paragraph from Word Document Using Apache POI

Tags:

java

apache

I have an word document Docx file

As you can see in the word document there are a number of questions with Bullet Points. Right now I am trying to extract each paragraph from the file using apache POI. Here is my current code

    public static String readDocxFile(String fileName) {
    try {
        File file = new File(fileName);
        FileInputStream fis = new FileInputStream(file.getAbsolutePath());
        XWPFDocument document = new XWPFDocument(fis);

        List<XWPFParagraph> paragraphs = document.getParagraphs();
        String whole = "";
        for (XWPFParagraph para : paragraphs) {
            System.out.println(para.getText());
            whole += "\n" + para.getText();
        }
        fis.close();
        document.close();
        return whole;
    } catch (Exception e) {
        e.printStackTrace();
        return "";
    }
    }

The problem with above method is that it is printing each line instead of paragraphs. Also the bullet points are also gone from extracted whole string. The whole is returned a plain string.

Can anyone explain what I am doing wrong. Also please suggest if you have a better idea to solve it.

like image 430
Mars Moon Avatar asked Feb 01 '18 07:02

Mars Moon


1 Answers

Above code is correct and I ran your code on my system that giving each and every paragraphs , I think problem with writting content on docx file whenever I wrote content in bullet points and uses 'enter' key than that breaks my current bullet points and above code make that breaked-line as saparate paragraph.

I am writting below code sample may be It's useful for you take a look here I am using Set datastructure for ignoring duplicate questions from docx .

Dependency of apache poi is below

<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-ooxml</artifactId>
    <version>3.7</version>
</dependency>

Code Sample :

package com;

import java.io.File;
import java.io.FileInputStream;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.springframework.util.ObjectUtils;

public class App {

    public static void main(String...strings) throws Exception{
        Set<String> bulletPoints = fileExtractor(); 
        bulletPoints.forEach(point -> {
            System.out.println(point);
        });
    }

    public static Set<String> fileExtractor() throws Exception{
        FileInputStream fis = null;
        try {
            Set<String> bulletPoints = new HashSet<>();
            File file = new File("/home/deskuser/Documents/query.docx");
            fis = new FileInputStream(file.getAbsolutePath());
            XWPFDocument document = new XWPFDocument(fis);

            List<XWPFParagraph> paragraphs = document.getParagraphs();
            paragraphs.forEach(para -> {
                System.out.println(para.getText());
                if(!ObjectUtils.isEmpty(para.getText())){
                    bulletPoints.add(para.getText());
                }
            });
            fis.close();

            return bulletPoints;
        } catch (Exception e) {
            e.printStackTrace();
            throw new Exception("error while extracting file.", e);
        }finally{
            if(!ObjectUtils.isEmpty(fis)){
                fis.close();
            }
        }
    }
}
like image 162
ritesh9984 Avatar answered Sep 30 '22 13:09

ritesh9984