I was trying to write regex for identifying name starting with MR|MS|THE|DR after honorable for example <pre class="prettyprint"><code> HONOURABLE THE CHIEF JUSTICE MR. JUSTICE 1 VIKRAM NATH,HONOURABLE MR. JUSTICE 1 1 0 3 5 J.B.PARDIWALA HONOURABLE THE CHIEF JUSTICE MR. JUSTICE 2 VIKRAM NATH,HONOURABLE MR. JUSTICE VIPUL M. 0 1 0 0 1 PANCHOLI HONOURABLE THE CHIEF JUSTICE MR. JUSTICE 3 VIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH 107 4 10 6 127 J. SHASTRI </code></pre> So, the output should be <pre class="prettyprint"><code>[THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH, MR. JUSTICE J.B.PARDIWALA] [THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH, MR. JUSTICE VIPUL M. PANCHOLI] and so on </code></pre> but I'm getting <pre class="prettyprint"><code>THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH MR. JUSTICE 1 1 0 3 5 J.B.PARDIWALA </code></pre> I have tried <code>\s*HONOURABLE\s+(?=THE|MR|MS|DR)([^/\[\]\n]*)</code> HONOURABLE can be repeated any no. of times. Any help would be appreciated Thanks in advance!

Bounty answer You can use <pre class="prettyprint lang-py prettyprint-override"><code>import re text = """ HONOURABLE THE CHIEF JUSTICE MR. JUSTICE 1 VIKRAM NATH,HONOURABLE MR. JUSTICE 1 1 0 3 5 J.B.PARDIWALA HONOURABLE THE CHIEF JUSTICE MR. JUSTICE 2 VIKRAM NATH,HONOURABLE MR. JUSTICE VIPUL M. 0 1 0 0 1 PANCHOLI HONOURABLE THE CHIEF JUSTICE MR. JUSTICE 3 VIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH 107 4 10 6 127 J. SHASTRI""" text = re.sub(r'^[\d \t]+|[\d \t]+$', '', text, flags=re.M) #print(text) m = re.findall(r'^HONOURABLE\s+(.*(?:\n(?!HONOURABLE\b).*)*)', text, re.M) for x in m: print(x.replace('\n',' ')) </code></pre> Output: <pre class="prettyprint"><code>[ 'THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH,HONOURABLE MR. JUSTICE J.B.PARDIWALA', 'THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH,HONOURABLE MR. JUSTICE VIPUL M. PANCHOLI', 'THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH J. SHASTRI' ] </code></pre> See the Python demo. Details: <ul> <li> <code>re.sub(r'^[\d \t]+|[\d \t]+$', '', text, flags=re.M)</code> removes all spaces, tabs and digits from the start and end of each line in your text. </li> <li> <code>r'^HONOURABLE\s+(.*(?:\n(?!HONOURABLE\b).*)*)'</code> is a regex that matches the following in the "trimmed" text: </li> <li> <code>^</code> - start of a line </li> <li> <code>HONOURABLE</code> - a word <code>HONOURABLE</code> </li> <li> <code>\s+</code> - one or more whitespaces </li> <li> <code>(.*(?:\n(?!HONOURABLE\b).*)*)</code> - Capturing group 1: <ul> <li> <code>.*</code> - the rest of the line</li> <li> <code>(?:\n(?!HONOURABLE\b).*)*</code> - zero or more lines that do not start with <code>HONOURABLE</code> as a whole word.</li> </ul> </li> </ul> Original answer You can use <pre class="prettyprint lang-py prettyprint-override"><code>\bHONOURABLE\s+((?:THE|MR|MS|DR)[^,]*) </code></pre> See the regex demo. If you do not want to have linebreaks in the resulting list items, you may later replace them with <code>.replace('\n', ' ')</code>. If you want to curb the right hand boundary of your matches at <code>[</code>, <code>\</code> and <code>]</code>, add them to the negated character class, change <code>[^,]</code> to <code>[^][/,]</code>. Details: <ul> <li> <code>\bHONOURABLE</code> - a whole word <code>HONOURABLE</code> </li> <li> <code>\s+</code> - one or more whitespaces</li> <li> <code>((?:THE|MR|MS|DR)[^,]*)</code> - Capturing group 1: <code>THE</code>, <code>MR</code>, <code>MS</code>, <code>DR</code> followed with zero or more chars other than a comma.</li> </ul> See a Python demo: <pre class="prettyprint lang-py prettyprint-override"><code>import re rx = r"\bHONOURABLE\s+((?:THE|MR|MS|DR)\b[^,]*)" text = "HONOURABLE THE CHIEF JUSTICE MR. JUSTICE\nVIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH\nJ. SHASTRI, HONOURABLE MS. ADITI GUPTA" m = re.findall(rx, text) print([x.replace('\n','') for x in m]) </code></pre> Output: <pre class="prettyprint lang-py prettyprint-override"><code>['THE CHIEF JUSTICE MR. JUSTICEVIKRAM NATH', 'MR. JUSTICE ASHUTOSHJ. SHASTRI', 'MS. ADITI GUPTA'] </code></pre>

Regex for extracting names starting with Mr.|Mrs|The|DR after honorable

Tags:

python

regex

I was trying to write regex for identifying name starting with MR|MS|THE|DR after honorable

for example

      HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
 1    VIKRAM NATH,HONOURABLE MR. JUSTICE             1     1      0     3       5
      J.B.PARDIWALA
      HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
 2    VIKRAM NATH,HONOURABLE MR. JUSTICE VIPUL M.    0     1      0     0       1
      PANCHOLI
      HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
 3    VIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH   107    4     10     6      127
      J. SHASTRI

So, the output should be

[THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH, MR. JUSTICE J.B.PARDIWALA]
[THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH, MR. JUSTICE VIPUL M. PANCHOLI]
and so on

but I'm getting

THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH 
MR. JUSTICE             1     1      0     3       5
      J.B.PARDIWALA

I have tried \s*HONOURABLE\s+(?=THE|MR|MS|DR)([^/\[\]\n]*)

HONOURABLE can be repeated any no. of times.

Any help would be appreciated

Thanks in advance!

879

asked Feb 04 '21 13:02

Jayesh Agarwal

Video Answer

1 Answers

Bounty answer

You can use

import re
text = """     HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
 1    VIKRAM NATH,HONOURABLE MR. JUSTICE             1     1      0     3       5
      J.B.PARDIWALA
      HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
 2    VIKRAM NATH,HONOURABLE MR. JUSTICE VIPUL M.    0     1      0     0       1
      PANCHOLI
      HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
 3    VIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH   107    4     10     6      127
      J. SHASTRI"""
text = re.sub(r'^[\d \t]+|[\d \t]+$', '', text, flags=re.M)
#print(text)
m = re.findall(r'^HONOURABLE\s+(.*(?:\n(?!HONOURABLE\b).*)*)', text, re.M)
for x in m:
    print(x.replace('\n',' '))

Output:

[
  'THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH,HONOURABLE MR. JUSTICE J.B.PARDIWALA',
  'THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH,HONOURABLE MR. JUSTICE VIPUL M. PANCHOLI',
  'THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH J. SHASTRI'
]

See the Python demo.

Details:

re.sub(r'^[\d \t]+|[\d \t]+$', '', text, flags=re.M) removes all spaces, tabs and digits from the start and end of each line in your text.
r'^HONOURABLE\s+(.*(?:\n(?!HONOURABLE\b).*)*)' is a regex that matches the following in the "trimmed" text:
^ - start of a line
HONOURABLE - a word HONOURABLE
\s+ - one or more whitespaces
(.*(?:\n(?!HONOURABLE\b).*)*) - Capturing group 1:
- .* - the rest of the line
- (?:\n(?!HONOURABLE\b).*)* - zero or more lines that do not start with HONOURABLE as a whole word.

Original answer You can use

\bHONOURABLE\s+((?:THE|MR|MS|DR)[^,]*)

See the regex demo. If you do not want to have linebreaks in the resulting list items, you may later replace them with .replace('\n', ' '). If you want to curb the right hand boundary of your matches at [, \ and ], add them to the negated character class, change [^,] to [^][/,].

Details:

\bHONOURABLE - a whole word HONOURABLE
\s+ - one or more whitespaces
((?:THE|MR|MS|DR)[^,]*) - Capturing group 1: THE, MR, MS, DR followed with zero or more chars other than a comma.

See a Python demo:

import re
rx = r"\bHONOURABLE\s+((?:THE|MR|MS|DR)\b[^,]*)"
text = "HONOURABLE THE CHIEF JUSTICE MR. JUSTICE\nVIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH\nJ. SHASTRI, HONOURABLE MS. ADITI GUPTA"
m = re.findall(rx, text)
print([x.replace('\n','') for x in m])

Output:

['THE CHIEF JUSTICE MR. JUSTICEVIKRAM NATH', 'MR. JUSTICE ASHUTOSHJ. SHASTRI', 'MS. ADITI GUPTA']

175

answered Nov 15 '22 15:11

Wiktor Stribiżew

Related questions
                            
                                how do you install poppler on google colab
                            
                                Adding a pause in Google-text-to-speech
                            
                                aws lambda not logging print statements
                            
                                Plotly express bar chart colour change
                            
                                Updated to Python 3.8 - Terminal won't open - [Fixed] but details not understood
                            
                                Extracting Key-Phrases from text based on the Topic with Python
                            
                                How Can I Make My Bullets Look LIke They Are Comming Out Of My Guns Tip?
                            
                                Number of instances per class in pytorch dataset
                            
                                What does next() and iter() do in PyTorch's DataLoader()
                            
                                Is AWS boto (python) supporting SES signature version 4?
                            
                                Create sub cell in Spyder
                            
                                Pandas Dataframe replace part of string with value from another column
                            
                                X axis in Matplotlib print random numbers instead of the years
                            
                                Best way to specify nested dict with pydantic?
                            
                                Finding the width of the emoji using python3
                            
                                How do add an assembled field to a Pydantic model
                            
                                What is the safest way to queue multiple threads originating in a loop?
                            
                                removing loops with numpy.einsum
                            
                                Pygame Tic Tak Toe Logic? How Would I Do It
                            
                                Plotly: Create a Scatter with categorical x-axis jitter and multi level axis

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With