Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to match specific file format and empty strings

Tags:

java

regex

I am trying to use regex to match a file in the following format:

FILTER
<data>
ORDER
<data>

Now, the <data> part is the one that I need to extract, and that would be really simple, except I have the following complications:

1) This pattern can be repeated (no line breaks inbetween)

2) The <data>s could be not there.

In particular, this file is OK:

FILTER
test1
ORDER
test2
FILTER
test3
ORDER
FILTER
ORDER

And should give me the following groups:

"test1", "test2", "test3", "", "", ""

The regex that I already tried is: (?:FILTER\n(.*)\nORDER\n(.*))*

Here is the test on regex101.

I am pretty new to regex, any help would be appreciated.

like image 618
sadfsa sdfasdf Avatar asked Oct 30 '22 03:10

sadfsa sdfasdf


1 Answers

You may use a lazy-dot matching + tempered greedy token based regex:

(?s)FILTER(.*?)ORDER((?:(?!FILTER).)*)
           ^-^       ^--------------^

Use a DOTALL modifier with this regex. Here is a regex demo. The .*? matches any character but as few as possilbe, thus, matching up to the first ORDER. The (?:(?!FILTER).)* tempered greedy token matches any text that is not FILTER. It is a kind of a negated character class synonym for multicharacter sequences.

You can unroll it as follows:

FILTER([^O]*(?:O(?!RDER)[^O]*)*)ORDER([^F]*(?:F(?!ILTER)[^F]*)*)

See the regex demo (and this regex does not require a DOTALL mode).

String s = "FILTER\ntest1\nORDER\ntest2\nFILTER\ntest3\nORDER\nFILTER\nORDER";
Pattern pattern = Pattern.compile("(?s)FILTER(.*?)ORDER((?:(?!FILTER).)*)");
Matcher matcher = pattern.matcher(s);
List<String> results = new ArrayList<>();
while (matcher.find()){
    if (matcher.group(1) != null) {
        results.add(matcher.group(1).trim());
    } 
    if (matcher.group(2) != null) {
        results.add(matcher.group(2).trim());
    } 
} 
System.out.println(results);  // => [test1, test2, test3, , , ]

See the IDEONE demo

If you need to make sure the FILTER and ORDER delimiter strings appear as individual lines, just use ^ and $ around them and add MULTILINE modifier (so that ^ could match the beginning of a line and $ could match the end of the line):

(?sm)^FILTER$(.*?)^ORDER$((?:(?!^FILTER$).)*)
 ^^^^

See another regex.

like image 75
Wiktor Stribiżew Avatar answered Nov 15 '22 03:11

Wiktor Stribiżew