Extracting text from string using Regex

Question

I have a very large string. There are many paragraphs inside that string. Each paragraph starts with a title and follows a particular pattern.

Example:

== Title1 == // Paragraph starts ............. ............. // Some texts ............. End of Paragraph ===Title2 === // Paragraph starts ............. ............. // Some texts .............

The pattern of the title are:

1.) New Paragraph title starts with an equal to ( = ) and can be followed by any number of =.

2.) After = , there can be a white space ( not necessary though ) and it is followed by text.

3.) After text completion, again there can be a white space ( not necessary ), followed by again any number of equal to's ( = ).

4.) Now the paragraph starts. I have to extract the text until it encounters a similar pattern.

Can anyone help me with how to do this with regex? TIA

Wiktor Stribiżew · Accepted Answer

You may use

re.findall(r'(?m)^=+[^\S
]*(.*?)[^\S
]*=+\s*(.*(?:
?
(?!=+.*?=).*)*)', s)

See the regex demo

Details

(?m)^ - start of a line
=+ - 1 or more = chars
[^\S ]* - zero or more whitespace chars other than CR and LF
(.*?) - Group 1: any zero or more chars, other than line break chars, as few as possible
[^\S ]* - zero or more whitespace chars other than CR and LF
=+ - 1 or more = chars
\s* - 0+ whitespaces
(.*(?: ? (?!==+.*?=).*)*) - Group 2:
- .* - any zero or more chars, other than line break chars, as many as possible
- (?: ? (?!=+.*?=).*)* - zero or more sequences of
  - ? (?!=+.*?=) - an optional CR and then LF that is not followed with 1+ =s, then any chars other than line break chars as few as possible and then again 1+ =s
  - .* - any zero or more chars, other than line break chars, as many as possible

Python demo:

import re

rx = r"(?m)^=+[^\S
]*(.*?)[^\S
]*=+\s*(.*(?:
?
(?!=+.*?=).*)*)"
s = "== Title1 ==
..........................
.............
End of Paragraph
===Title2 ===
.............
.............
............."
print(re.findall(rx, s))

Output:

[('Title1', '..........................
.............
End of Paragraph'), ('Title2', '.............
.............
.............')]

utks009 · Answer

May be this helps for finding each paragraphs Title and lines of each paragraph.

text = """== Title1 == // Paragraph starts
.............
............. // Some texts
.............
End of Paragraph
===Title2 === // Paragraph starts
.............
............. // Some texts
.............
"""
import re

reg = re.compile(r'(?:[=]+\s*\w+\s*[=]+)')

for i in text.split('
'):
    if re.search(reg, i):
        t = re.sub(r'=', '', i)
        print('Title:', t.strip())
    else:
        print('line:', i.strip())

 # Output like this
   Title: Title1  // Paragraph starts
   line: .............
   line: ............. // Some texts
   line: .............
   line: End of Paragraph
   Title: Title2  // Paragraph starts
   line: .............
   line: ............. // Some texts
   line: .............
   line:

Extracting text from string using Regex

Tags:

python

regex

Gopal Chitalia

2 Answers

Wiktor Stribiżew

utks009

Recent Activity

Donate For Us

Extracting text from string using Regex

Tags:

python

regex

Gopal Chitalia

2 Answers

Wiktor Stribiżew

utks009

Related questions

Recent Activity

Donate For Us