I have a very large string. There are many paragraphs inside that string. Each paragraph starts with a title and follows a particular pattern.
Example:
== Title1 == // Paragraph starts
.............
............. // Some texts
.............
End of Paragraph
===Title2 === // Paragraph starts
.............
............. // Some texts
.............
The pattern of the title are:
1.) New Paragraph title starts with an equal to ( = ) and can be followed by any number of =.
2.) After = , there can be a white space ( not necessary though ) and it is followed by text.
3.) After text completion, again there can be a white space ( not necessary ), followed by again any number of equal to's ( = ).
4.) Now the paragraph starts. I have to extract the text until it encounters a similar pattern.
Can anyone help me with how to do this with regex? TIA
You may use
re.findall(r'(?m)^=+[^\S\r\n]*(.*?)[^\S\r\n]*=+\s*(.*(?:\r?\n(?!=+.*?=).*)*)', s)
See the regex demo
Details
(?m)^
- start of a line=+
- 1 or more =
chars[^\S\r\n]*
- zero or more whitespace chars other than CR and LF(.*?)
- Group 1: any zero or more chars, other than line break chars, as few as possible[^\S\r\n]*
- zero or more whitespace chars other than CR and LF=+
- 1 or more =
chars\s*
- 0+ whitespaces(.*(?:\r?\n(?!==+.*?=).*)*)
- Group 2:
.*
- any zero or more chars, other than line break chars, as many as possible(?:\r?\n(?!=+.*?=).*)*
- zero or more sequences of
\r?\n(?!=+.*?=)
- an optional CR and then LF that is not followed with 1+ =
s, then any chars other than line break chars as few as possible and then again 1+ =
s.*
- any zero or more chars, other than line break chars, as many as possiblePython demo:
import re
rx = r"(?m)^=+[^\S\r\n]*(.*?)[^\S\r\n]*=+\s*(.*(?:\r?\n(?!=+.*?=).*)*)"
s = "== Title1 ==\n..........................\n.............\nEnd of Paragraph\n===Title2 ===\n.............\n.............\n............."
print(re.findall(rx, s))
Output:
[('Title1', '..........................\n.............\nEnd of Paragraph'), ('Title2', '.............\n.............\n.............')]
May be this helps for finding each paragraphs Title and lines of each paragraph.
text = """== Title1 == // Paragraph starts
.............
............. // Some texts
.............
End of Paragraph
===Title2 === // Paragraph starts
.............
............. // Some texts
.............
"""
import re
reg = re.compile(r'(?:[=]+\s*\w+\s*[=]+)')
for i in text.split('\n'):
if re.search(reg, i):
t = re.sub(r'=', '', i)
print('Title:', t.strip())
else:
print('line:', i.strip())
# Output like this
Title: Title1 // Paragraph starts
line: .............
line: ............. // Some texts
line: .............
line: End of Paragraph
Title: Title2 // Paragraph starts
line: .............
line: ............. // Some texts
line: .............
line:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With