Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex match from start label until empty line or end label

Tags:

regex

How can I match the content between a startlabel and either a empty line or an endlabel with a regex?

For example regex101 link:

<START> some text is here. 
more text

unrelated text

<START> even more text. 
text text
<STOP>

It should match two matches

<START> some text is here. 
more text

and

<START> even more text. 
text text
<STOP>

The regex I came up so far is as follows (but it matches the whole text, I assume because of the (?s).* part).

<START>((?s).*)(\s\s|<STOP>)
like image 412
tkja Avatar asked Sep 21 '15 22:09

tkja


People also ask

What does ?= * Mean in regex?

Save this question. . means match any character in regular expressions. * means zero or more occurrences of the SINGLE regex preceding it. My alphabet.txt contains a line abcdefghijklmnopqrstuvwxyz.

Which regex matches the end of line?

To match the start or the end of a line, we use the following anchors: Caret (^) matches the position before the first character in the string. Dollar ($) matches the position right after the last character in the string.

Does empty regex match everything?

An empty regular expression matches everything.

What is the difference between * and * in regex?

represents a single character (like the regex's . ) while * represents a sequence of zero or more characters (equivalent to regex . * ).


2 Answers

You should use a lazy quantifier for .* to match as few as it can. Using .*?:

(?s)(<START>.*?)(?:(?:\r*\n){2}|<STOP>)

Leaving out of the group what you specified as ending conditions.

  1. (?:\r*\n){2} an empty line.
  2. <STOP> the end label.

DEMO

like image 198
Mariano Avatar answered Oct 22 '22 21:10

Mariano


You can design your pattern like this (with the modifier m):

<START>[^\n<]*(?:(?:<(?!STOP>)|\n(?!$))[^\n<]*)*(?:<STOP>|\n$|\z)

demo

The idea is to match all that is not a < or a newline with [^\n<]*. When a < or a newline is reached, negative lookaheads check if they are not followed by "STOP>" or an end of line. If the negative lookahead succeeds then [^\n<]* (in the non-capturing group this time) reaches the next < or newline. The group is repeated until <STOP>, two newlines, the end of the string.

like image 37
Casimir et Hippolyte Avatar answered Oct 22 '22 23:10

Casimir et Hippolyte