Python regex catch optional trailing lines

Question

I have input text that is kind of structured, but where each main line can spread over multiple sub-lines. The spreading can occur in different manners however, and I can't figure out how to catch these multiple scenarios.

The basic structure of the input (text file) is like follows:

11 abc 1 1
22 abc 2 2
def
33 abc
def 3 3

In English: a line is composed of 2 digits, then some text and then 2 individual digits (e.g. "11 abc 1 1". The text may however be spread over 2 (or more lines). Sometimes the trailing digits appear on the first sub-line (e.g. "22 abc 2 2 def"), sometimes on the last sub-line of blocks of text that start with the 2 digits (e.g. "33 abd def 3 3").

My regexes manage to catch only one of the two scenarios.

I always used the following expression to get the matches:

re.findall(pat, t, re.M|re.DOTALL|re.X)

So I used re.M to specifically allow multiline matches, re.DOTALL to include newline characters and re.X to make the patterns more readable with whitespaces.

I expect the following result:

[('11', 'abc', '1', '1', ''),
 ('22', 'abc', '2', '2', 'def'),
 ('33', 'abc
def', '3', '3', '')]

In other words, I want the numbers always to appear in the same locations of the tuples, and the text may be split in 2 parts (2nd and last position of the tuple), but none of the parts may be ignored.

I tried with the following:

pat = r'^(\d\d) \s (.*?) \s (\d)\s(\d) (.*?)?'

But this doesn't catch the 2nd part of the "22 ..." line.

Then I tried a more greedy approach:

pat = r'^(\d\d) \s (.*?) \s (\d)\s(\d) (.*)?'

But this catches the entire string.

Then I tried a negative lookahead, with the intent to start the next match as soon as a double-digit is encountered:

pat = r'^(\d\d) \s (.*?) \s? (\d)\s(\d) (.*?) (?=\d\d)'

But this doesn't catch the "33 ..." line, because it is the last line, and therefore no double-digits follow.

I tried a few other crooked things, not worth mentioning, but I can't find a solution to my problem.

Any hints would be greatly appreciated.

Jon Zavialov · Accepted Answer

This works pretty well for me. I am splitting it into two scenarios: one where the trailing digits are on the same line, and another where they appear on the next line.

pattern = r'''
    ^(\d{2})
    \s+(.*?)
    (?:
        \s+(\d)\s+(\d)(.*?)$
    |
        \s*$
        (?:
            
(?!.*?\d{2}).*?
        )*?
        
(\d)\s+(\d)(.*?)$
    )
    '''

text = """
11 abc 1 1
22 abc 2 2
def
33 abc
def 3 3
"""

results = re.findall(pattern, text, re.MULTILINE | re.DOTALL | re.VERBOSE)

It's producing some empty groups at the end because of optional capturing groups, but these are easy to get rid of.

bobble bubble · Answer

In your first regex the last optional group (.*?)? does not contain any condition. What would you expect it to match? As few characters as possible = none characters, certainly it will be empty.

(.*)? in your second pattern together with re.DOTALL will greedily match the rest of the string - making it impossible to extract potential further matches which were consumed by the 1st match.

The last attempt with the lookahead looks most promising, but you need to put it inside the group as a stop condition. To match until a newline followed by two digits OR end: (.*?(?= \d\d|\Z))?

See this demo at regex101 (if you have CRLF line breaks, add an optional ? before the ).

Python regex catch optional trailing lines

Tags:

python

regex

Antoine De Groote

2 Answers

Jon Zavialov

bobble bubble

Recent Activity

Donate For Us

Python regex catch optional trailing lines

Tags:

python

regex

Antoine De Groote

2 Answers

Jon Zavialov

bobble bubble

Related questions

Recent Activity

Donate For Us