I was trying to write regex for identifying name starting with MR|MS|THE|DR after honorable
for example
HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
1 VIKRAM NATH,HONOURABLE MR. JUSTICE 1 1 0 3 5
J.B.PARDIWALA
HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
2 VIKRAM NATH,HONOURABLE MR. JUSTICE VIPUL M. 0 1 0 0 1
PANCHOLI
HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
3 VIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH 107 4 10 6 127
J. SHASTRI
So, the output should be
[THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH, MR. JUSTICE J.B.PARDIWALA]
[THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH, MR. JUSTICE VIPUL M. PANCHOLI]
and so on
but I'm getting
THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH
MR. JUSTICE 1 1 0 3 5
J.B.PARDIWALA
I have tried \s*HONOURABLE\s+(?=THE|MR|MS|DR)([^/\[\]\n]*)
HONOURABLE can be repeated any no. of times.
Any help would be appreciated
Thanks in advance!
\d (digit) matches any single digit (same as [0-9] ). The uppercase counterpart \D (non-digit) matches any single character that is not a digit (same as [^0-9] ). \s (space) matches any single whitespace (same as [ \t\n\r\f] , blank, tab, newline, carriage-return and form-feed).
Definition and Usage The \w metacharacter matches word characters. A word character is a character a-z, A-Z, 0-9, including _ (underscore).
A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern. RegEx can be used to check if a string contains the specified search pattern.
Bounty answer
You can use
import re
text = """ HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
1 VIKRAM NATH,HONOURABLE MR. JUSTICE 1 1 0 3 5
J.B.PARDIWALA
HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
2 VIKRAM NATH,HONOURABLE MR. JUSTICE VIPUL M. 0 1 0 0 1
PANCHOLI
HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
3 VIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH 107 4 10 6 127
J. SHASTRI"""
text = re.sub(r'^[\d \t]+|[\d \t]+$', '', text, flags=re.M)
#print(text)
m = re.findall(r'^HONOURABLE\s+(.*(?:\n(?!HONOURABLE\b).*)*)', text, re.M)
for x in m:
print(x.replace('\n',' '))
Output:
[
'THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH,HONOURABLE MR. JUSTICE J.B.PARDIWALA',
'THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH,HONOURABLE MR. JUSTICE VIPUL M. PANCHOLI',
'THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH J. SHASTRI'
]
See the Python demo.
Details:
re.sub(r'^[\d \t]+|[\d \t]+$', '', text, flags=re.M)
removes all spaces, tabs and digits from the start and end of each line in your text.
r'^HONOURABLE\s+(.*(?:\n(?!HONOURABLE\b).*)*)'
is a regex that matches the following in the "trimmed" text:
^
- start of a line
HONOURABLE
- a word HONOURABLE
\s+
- one or more whitespaces
(.*(?:\n(?!HONOURABLE\b).*)*)
- Capturing group 1:
.*
- the rest of the line(?:\n(?!HONOURABLE\b).*)*
- zero or more lines that do not start with HONOURABLE
as a whole word.Original answer You can use
\bHONOURABLE\s+((?:THE|MR|MS|DR)[^,]*)
See the regex demo. If you do not want to have linebreaks in the resulting list items, you may later replace them with .replace('\n', ' ')
. If you want to curb the right hand boundary of your matches at [
, \
and ]
, add them to the negated character class, change [^,]
to [^][/,]
.
Details:
\bHONOURABLE
- a whole word HONOURABLE
\s+
- one or more whitespaces((?:THE|MR|MS|DR)[^,]*)
- Capturing group 1: THE
, MR
, MS
, DR
followed with zero or more chars other than a comma.See a Python demo:
import re
rx = r"\bHONOURABLE\s+((?:THE|MR|MS|DR)\b[^,]*)"
text = "HONOURABLE THE CHIEF JUSTICE MR. JUSTICE\nVIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH\nJ. SHASTRI, HONOURABLE MS. ADITI GUPTA"
m = re.findall(rx, text)
print([x.replace('\n','') for x in m])
Output:
['THE CHIEF JUSTICE MR. JUSTICEVIKRAM NATH', 'MR. JUSTICE ASHUTOSHJ. SHASTRI', 'MS. ADITI GUPTA']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With