Apache POI - Retrieve text content between keywords in .doc file and conditionally render it

Question

I would like to find text content between two keywords in .doc files, and conditionally render that text content or hide it. For example:

Lorem Ipsum is simply dummy text ${if condition} of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s ${endif}

When I parse the document using the Apache - POI, I would like to be able in some way to spot in the document each and every content between these blockquotes ${if condition} ${endif} and conditionally render it or not in the next document I want to produce.

So the above text after my parsing should have the following two different forms:

1) In case the condition is satisfied

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s

or

2) In case the condition is not satisfied

Lorem Ipsum is simply dummy text

I have tried to do this by using the XWPFParagraph object and then XWPFRun but that is no way reliable way as a run can be randomly split in the middle of a word under unpredictable conditions.

Could you please propose any reliable way to achieve my use case? Thanks in advance.

Newbie · Accepted Answer

Take this as an example (code is tested):

class ParagraphModifier {

    private final Pattern pIf = Pattern.compile("\$\{if\s+(\w+)\}");
    private final Pattern pEIf = Pattern.compile("\$\{endif\}");
    private final Function<String, Boolean> processor;

    public ParagraphModifier(Function<String, Boolean> processor) {
        this.processor = processor;
    }

    // Process

    static class Pair<K, V> {
        public K key;
        public V value;
        public Pair(K key, V value) {
            this.key = key;
            this.value = value;
        }
    }

    // https://stackoverflow.com/questions/23112924
    public static void cloneRun(XWPFRun clone, XWPFRun source) {
        CTRPr rPr = clone.getCTR().isSetRPr() ? clone.getCTR().getRPr() : clone.getCTR().addNewRPr();
        rPr.set(source.getCTR().getRPr());
        clone.setText(source.getText(0));
    }

    // Split runs in paragraph at a specific text offset and returns the run index
    int splitAtTextPosition(XWPFParagraph paragraph, int position) {
        List<XWPFRun> runs = paragraph.getRuns();
        int offset = 0;

        for (int i = 0; i < runs.size(); i++) {
            XWPFRun run = runs.get(i);
            String text = run.getText(0);
            int length = text.length();

            if (position >= (offset + length)) {
                offset += length;
                continue;
            }

            // Do split
            XWPFRun run2 = paragraph.insertNewRun(i + 1);
            cloneRun(run2, run);
            run.setText(text.substring(0, position - offset), 0);
            run2.setText(text.substring(position - offset), 0);
            return i + 1;
        }
        return -1;
    }

    String getParagraphText(XWPFParagraph paragraph) {
        StringBuilder sb = new StringBuilder("");
        for (XWPFRun run : paragraph.getRuns()) sb.append(run.getText(0));
        return sb.toString();
    }

    void removeRunsRange(XWPFParagraph paragraph, int from, int to) {
        int runs = paragraph.getRuns().size();
        to = Math.min(to, runs);
        for (int i = (to - 1); i >= from; i--) {
            paragraph.removeRun(i);
        }
    }

    Pair<Integer, String> extractToken(Pattern pattern, XWPFParagraph paragraph) {
        String text = getParagraphText(paragraph);
        Matcher matcher = pattern.matcher(text);

        if (matcher.find()) {
            int rStart = splitAtTextPosition(paragraph, matcher.start());
            int rEnd = splitAtTextPosition(paragraph, matcher.end());
            removeRunsRange(paragraph, rStart, rEnd);
            return new Pair<>(rStart, matcher.group());
        } else {
            return new Pair<>(-1, "");
        }
    }

    void applyParagraph(XWPFParagraph paragraph) {
        int lastIf = -1;

        while (true) {
            var tIf = extractToken(pIf, paragraph);
            if (tIf.key == -1) {
                break;
            }
            if (tIf.key < lastIf) {
                throw new IllegalStateException("If conditions can not be nested");
            }

            var tEIf = extractToken(pEIf, paragraph);
            if (tEIf.key == -1) {
                throw new IllegalStateException("If condition missing endif");
            }

            var m = pIf.matcher(tIf.value);
            var keep = m.find() && processor.apply(m.group(1));
            if (!keep) {
                removeRunsRange(paragraph, tIf.key, tEIf.key);
            }

            lastIf = tEIf.key;
        }
    }

    void apply(Iterable<XWPFParagraph> paragraphs) {
        for (XWPFParagraph p : paragraphs) {
            applyParagraph(p);
        }
    }

}

Usage:

class Main {

    private static XWPFDocument loadDoc(String name) throws IOException, InvalidFormatException {
        String path = Main.class.getClassLoader().getResource(name).getPath();
        FileInputStream fis = new FileInputStream( path);
        return new XWPFDocument(OPCPackage.open(fis));
    }

    private static void saveDoc(String path, XWPFDocument doc) throws IOException {
        try (var fos = new FileOutputStream(path)) {
            doc.write(fos);
        }
    }

    public static void main (String[] args) throws Exception {
        var xdoc = loadDoc("test.docx");

        var pm = new ParagraphModifier(str -> str.toLowerCase().equals("true"));
        pm.apply(xdoc.getParagraphs());

        saveDoc("test.out.docx", xdoc);
    }
    
}

This solution does not support ${if } blocks spanning over paragraphs, if nesting, nor Table structures. Expanding the solution to support them should be straightforward.

Apache POI - Retrieve text content between keywords in .doc file and conditionally render it

Tags:

java

ms-word

apache-poi

xwpf

apache-poi-4

NickAth

1 Answers

Newbie

Recent Activity

Donate For Us

Apache POI - Retrieve text content between keywords in .doc file and conditionally render it

Tags:

java

ms-word

apache-poi

xwpf

apache-poi-4

NickAth

1 Answers

Newbie

Related questions

Recent Activity

Donate For Us