Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex for extracting names starting with Mr.|Mrs|The|DR after honorable

Tags:

python

regex

I was trying to write regex for identifying name starting with MR|MS|THE|DR after honorable

for example

      HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
 1    VIKRAM NATH,HONOURABLE MR. JUSTICE             1     1      0     3       5
      J.B.PARDIWALA
      HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
 2    VIKRAM NATH,HONOURABLE MR. JUSTICE VIPUL M.    0     1      0     0       1
      PANCHOLI
      HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
 3    VIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH   107    4     10     6      127
      J. SHASTRI

So, the output should be

[THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH, MR. JUSTICE J.B.PARDIWALA]
[THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH, MR. JUSTICE VIPUL M. PANCHOLI]
and so on

but I'm getting

THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH 
MR. JUSTICE             1     1      0     3       5
      J.B.PARDIWALA

I have tried \s*HONOURABLE\s+(?=THE|MR|MS|DR)([^/\[\]\n]*)

HONOURABLE can be repeated any no. of times.

Any help would be appreciated

Thanks in advance!

like image 879
Jayesh Agarwal Avatar asked Feb 04 '21 13:02

Jayesh Agarwal


People also ask

What is\ d RegEx?

\d (digit) matches any single digit (same as [0-9] ). The uppercase counterpart \D (non-digit) matches any single character that is not a digit (same as [^0-9] ). \s (space) matches any single whitespace (same as [ \t\n\r\f] , blank, tab, newline, carriage-return and form-feed).

What does w mean in RegEx?

Definition and Usage The \w metacharacter matches word characters. A word character is a character a-z, A-Z, 0-9, including _ (underscore).

What is regular expression in Python?

A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern. RegEx can be used to check if a string contains the specified search pattern.


Video Answer


1 Answers

Bounty answer

You can use

import re
text = """     HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
 1    VIKRAM NATH,HONOURABLE MR. JUSTICE             1     1      0     3       5
      J.B.PARDIWALA
      HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
 2    VIKRAM NATH,HONOURABLE MR. JUSTICE VIPUL M.    0     1      0     0       1
      PANCHOLI
      HONOURABLE THE CHIEF JUSTICE MR. JUSTICE
 3    VIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH   107    4     10     6      127
      J. SHASTRI"""
text = re.sub(r'^[\d \t]+|[\d \t]+$', '', text, flags=re.M)
#print(text)
m = re.findall(r'^HONOURABLE\s+(.*(?:\n(?!HONOURABLE\b).*)*)', text, re.M)
for x in m:
    print(x.replace('\n',' '))

Output:

[
  'THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH,HONOURABLE MR. JUSTICE J.B.PARDIWALA',
  'THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH,HONOURABLE MR. JUSTICE VIPUL M. PANCHOLI',
  'THE CHIEF JUSTICE MR. JUSTICE VIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH J. SHASTRI'
]

See the Python demo.

Details:

  • re.sub(r'^[\d \t]+|[\d \t]+$', '', text, flags=re.M) removes all spaces, tabs and digits from the start and end of each line in your text.

  • r'^HONOURABLE\s+(.*(?:\n(?!HONOURABLE\b).*)*)' is a regex that matches the following in the "trimmed" text:

  • ^ - start of a line

  • HONOURABLE - a word HONOURABLE

  • \s+ - one or more whitespaces

  • (.*(?:\n(?!HONOURABLE\b).*)*) - Capturing group 1:

    • .* - the rest of the line
    • (?:\n(?!HONOURABLE\b).*)* - zero or more lines that do not start with HONOURABLE as a whole word.

Original answer You can use

\bHONOURABLE\s+((?:THE|MR|MS|DR)[^,]*)

See the regex demo. If you do not want to have linebreaks in the resulting list items, you may later replace them with .replace('\n', ' '). If you want to curb the right hand boundary of your matches at [, \ and ], add them to the negated character class, change [^,] to [^][/,].

Details:

  • \bHONOURABLE - a whole word HONOURABLE
  • \s+ - one or more whitespaces
  • ((?:THE|MR|MS|DR)[^,]*) - Capturing group 1: THE, MR, MS, DR followed with zero or more chars other than a comma.

See a Python demo:

import re
rx = r"\bHONOURABLE\s+((?:THE|MR|MS|DR)\b[^,]*)"
text = "HONOURABLE THE CHIEF JUSTICE MR. JUSTICE\nVIKRAM NATH,HONOURABLE MR. JUSTICE ASHUTOSH\nJ. SHASTRI, HONOURABLE MS. ADITI GUPTA"
m = re.findall(rx, text)
print([x.replace('\n','') for x in m])

Output:

['THE CHIEF JUSTICE MR. JUSTICEVIKRAM NATH', 'MR. JUSTICE ASHUTOSHJ. SHASTRI', 'MS. ADITI GUPTA']
like image 175
Wiktor Stribiżew Avatar answered Nov 15 '22 15:11

Wiktor Stribiżew