I have input text that is kind of structured, but where each main line can spread over multiple sub-lines. The spreading can occur in different manners however, and I can't figure out how to catch these multiple scenarios.
The basic structure of the input (text file) is like follows:
11 abc 1 1
22 abc 2 2
def
33 abc
def 3 3
In English: a line is composed of 2 digits, then some text and then 2 individual digits (e.g. "11 abc 1 1". The text may however be spread over 2 (or more lines). Sometimes the trailing digits appear on the first sub-line (e.g. "22 abc 2 2\ndef"), sometimes on the last sub-line of blocks of text that start with the 2 digits (e.g. "33 abd\ndef 3 3").
My regexes manage to catch only one of the two scenarios.
I always used the following expression to get the matches:
re.findall(pat, t, re.M|re.DOTALL|re.X)
So I used re.M to specifically allow multiline matches, re.DOTALL to include newline characters and re.X to make the patterns more readable with whitespaces.
I expect the following result:
[('11', 'abc', '1', '1', ''),
('22', 'abc', '2', '2', 'def'),
('33', 'abc\ndef', '3', '3', '')]
In other words, I want the numbers always to appear in the same locations of the tuples, and the text may be split in 2 parts (2nd and last position of the tuple), but none of the parts may be ignored.
I tried with the following:
pat = r'^(\d\d) \s (.*?) \s (\d)\s(\d) (.*?)?'
But this doesn't catch the 2nd part of the "22 ..." line.
Then I tried a more greedy approach:
pat = r'^(\d\d) \s (.*?) \s (\d)\s(\d) (.*)?'
But this catches the entire string.
Then I tried a negative lookahead, with the intent to start the next match as soon as a double-digit is encountered:
pat = r'^(\d\d) \s (.*?) \s? (\d)\s(\d) (.*?) (?=\d\d)'
But this doesn't catch the "33 ..." line, because it is the last line, and therefore no double-digits follow.
I tried a few other crooked things, not worth mentioning, but I can't find a solution to my problem.
Any hints would be greatly appreciated.
This works pretty well for me. I am splitting it into two scenarios: one where the trailing digits are on the same line, and another where they appear on the next line.
pattern = r'''
^(\d{2})
\s+(.*?)
(?:
\s+(\d)\s+(\d)(.*?)$
|
\s*$
(?:
\n(?!.*?\d{2}).*?
)*?
\n(\d)\s+(\d)(.*?)$
)
'''
text = """
11 abc 1 1
22 abc 2 2
def
33 abc
def 3 3
"""
results = re.findall(pattern, text, re.MULTILINE | re.DOTALL | re.VERBOSE)
It's producing some empty groups at the end because of optional capturing groups, but these are easy to get rid of.
In your first regex the last optional group (.*?)? does not contain any condition. What would you expect it to match? As few characters as possible = none characters, certainly it will be empty.
(.*)? in your second pattern together with re.DOTALL will greedily match the rest of the string - making it impossible to extract potential further matches which were consumed by the 1st match.
The last attempt with the lookahead looks most promising, but you need to put it inside the group as a stop condition. To match until a newline followed by two digits OR end: (.*?(?=\n\d\d|\Z))?
See this demo at regex101 (if you have CRLF line breaks, add an optional \r? before the \n).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With