I have a very large string. There are many paragraphs inside that string. Each paragraph starts with a title and follows a particular pattern.
Example:
== Title1 == // Paragraph starts
.............
............. // Some texts
.............
End of Paragraph
===Title2 === // Paragraph starts
.............
............. // Some texts
.............
The pattern of the title are:
1.) New Paragraph title starts with an equal to ( = ) and can be followed by any number of =.
2.) After = , there can be a white space ( not necessary though ) and it is followed by text.
3.) After text completion, again there can be a white space ( not necessary ), followed by again any number of equal to's ( = ).
4.) Now the paragraph starts. I have to extract the text until it encounters a similar pattern.
Can anyone help me with how to do this with regex? TIA
You may use
re.findall(r'(?m)^=+[^\S\r\n]*(.*?)[^\S\r\n]*=+\s*(.*(?:\r?\n(?!=+.*?=).*)*)', s)
See the regex demo
Details
(?m)^ - start of a line=+ - 1 or more = chars[^\S\r\n]* - zero or more whitespace chars other than CR and LF(.*?) - Group 1: any zero or more chars, other than line break chars, as few as possible[^\S\r\n]* - zero or more whitespace chars other than CR and LF=+ - 1 or more = chars\s* - 0+ whitespaces(.*(?:\r?\n(?!==+.*?=).*)*) - Group 2:
.* - any zero or more chars, other than line break chars, as many as possible(?:\r?\n(?!=+.*?=).*)* - zero or more sequences of
\r?\n(?!=+.*?=) - an optional CR and then LF that is not followed with 1+ =s, then any chars other than line break chars as few as possible and then again 1+ =s.* - any zero or more chars, other than line break chars, as many as possiblePython demo:
import re
rx = r"(?m)^=+[^\S\r\n]*(.*?)[^\S\r\n]*=+\s*(.*(?:\r?\n(?!=+.*?=).*)*)"
s = "== Title1 ==\n..........................\n.............\nEnd of Paragraph\n===Title2 ===\n.............\n.............\n............."
print(re.findall(rx, s))
Output:
[('Title1', '..........................\n.............\nEnd of Paragraph'), ('Title2', '.............\n.............\n.............')]
                        May be this helps for finding each paragraphs Title and lines of each paragraph.
text = """== Title1 == // Paragraph starts
.............
............. // Some texts
.............
End of Paragraph
===Title2 === // Paragraph starts
.............
............. // Some texts
.............
"""
import re
reg = re.compile(r'(?:[=]+\s*\w+\s*[=]+)')
for i in text.split('\n'):
    if re.search(reg, i):
        t = re.sub(r'=', '', i)
        print('Title:', t.strip())
    else:
        print('line:', i.strip())
 # Output like this
   Title: Title1  // Paragraph starts
   line: .............
   line: ............. // Some texts
   line: .............
   line: End of Paragraph
   Title: Title2  // Paragraph starts
   line: .............
   line: ............. // Some texts
   line: .............
   line: 
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With