Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get the list of object containing text matching a pattern

I'm currently working with the API Apache POI and I'm trying to edit a Word document with it (*.docx). A document is composed by paragraphs (in XWPFParagraph objects) and a paragraph contains text embedded in 'runs' (XWPFRun). A paragraph can have many runs (depending on the text properties, but it's sometimes random). In my document I can have specific tags which I need to replace with data (all my tags follows this pattern <#TAG_NAME#>)

So for example, if I process a paragraph containing the text Some text with a tag <#SOMETAG#>, I could get something like this

XWPFParagraph paragraph = ... // Get a paragraph from the document
System.out.println(paragraph.getText());
// Prints: Some text with a tag <#SOMETAG#>

But if I want to edit the text of that paragraph I need to process the runs and the number of runs is not fixed. So if I show the content of runs with that code:

System.out.println("Number of runs: " + paragraph.getRuns().size());
for (XWPFRun run : paragraph.getRuns()) {
    System.out.println(run.text());
}

Sometimes it can be like this:

// Output:
// Number of runs: 1
// Some text with a tag <#SOMETAG#>

And other time like this

// Output:
// Number of runs: 4
// Some text with a tag 
// <#
// SOMETAG
// #>

What I need to do is to get the first run containing the start of the tag and the indexes of the following runs containing the rest of the tag (if the tag is divided in many runs). I've managed to get a first version of that algorithm but it only works if the beginning of the tag (<#) and the end of the tag (#>) aren't divided. Here's what I've already done.

So what I would like to get is an algorithm capable to manage that problem and if possible get it work with any given tag (not necessarily <# and #>, so I could replace with something like this {{{ and this }}}).

Sorry if my English isn't perfect, don't hesitate to ask me to clarify any point you want.

like image 470
Florentin Le Moal Avatar asked Jun 10 '15 09:06

Florentin Le Moal


People also ask

How do you match a pattern to a list in Python?

Method : Using join regex + loop + re.match() In this, we create a new regex string by joining all the regex list and then match the string against it to check for match using match() with any of the element of regex list.

How do you split a string by the occurrences of a regex pattern?

Regex to Split string with multiple delimitersWith the regex split() method, you will get more flexibility. You can specify a pattern for the delimiters where you can specify multiple delimiters, while with the string's split() method, you could have used only a fixed character or set of characters to split a string.


1 Answers

Finally I found the answer myself, I totally changed my way of thinking my original algorithm (I commented it so it might help someone who could be in the same situation I was)

// Before using the function, I'm sure that:
// paragraph.getText().contains(surroundedTag) == true
private void editParagraphWithData(XWPFParagraph paragraph, String surroundedTag, String replacement) {
    List<Integer> runsToRemove = new LinkedList<Integer>();
    StringBuilder tmpText = new StringBuilder();
    int runCursor = 0;

    // Processing (in normal order) the all runs until I found my surroundedTag
    while (!tmpText.toString().contains(surroundedTag)) {
        tmpText.append(paragraph.getRuns().get(runCursor).text());
        runsToRemove.add(runCursor);
        runCursor++;
    }

    tmpText = new StringBuilder();
    // Processing back (in reverse order) to only keep the runs I need to edit/remove
    while (!tmpText.toString().contains(surroundedTag)) {
        runCursor--;
        tmpText.insert(0, paragraph.getRuns().get(runCursor).text());
    }

    // Edit the first run of the tag
    XWPFRun runToEdit = paragraph.getRuns().get(runCursor);
    runToEdit.setText(tmpText.toString().replaceAll(surroundedTag, replacement), 0);

    // Forget the runs I don't to remove
    while (runCursor >= 0) {
        runsToRemove.remove(0);
        runCursor--;
    }

    // Remove the unused runs
    Collections.reverse(runsToRemove);
    for (Integer runToRemove : runsToRemove) {
        paragraph.removeRun(runToRemove);
    }
}

So now I'm processing all runs of the paragraph until I found my surrounded tag, then I'm processing back the paragraph to ignore the first runs if I don't need to edit them.

like image 136
Florentin Le Moal Avatar answered Sep 22 '22 05:09

Florentin Le Moal