Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to give priority for a regex pattern over another

Tags:

python

regex

I am using regular expressions to extract university names. Mainly two patterns are observed.

  1. "some name" university --> ex: Anna University
  2. university of "something" --> ex: University of Exeter

For this, I have written two patterns as,

regex = re.compile('|'.join([r'[Uu]niversity of (\w+){1,3}',r'(?:\S+\s){1,3}\S*[uU]niversity']))

But in few cases I am not getting proper expected answer. For example,

sentence  = "Biology Department University of Vienna"

For this sentence, applying above regular expression, I am getting

"Biology Department University"

which is wrong. I feel, since both patterns will be matched, the second pattern is getting matched and phrase is extracted.

I need to give priority for first pattern, so that "university of something" will be extracted in similar scenario.

can anybody help

like image 474
Bhimasen Avatar asked Dec 07 '16 06:12

Bhimasen


People also ask

How do you chain in regular expressions?

Chaining regular expressions Regular expressions can be chained together using the pipe character (|). This allows for multiple search options to be acceptable in a single regex string.

How do you replace all occurrences of a regex pattern in a string?

sub() method will replace all pattern occurrences in the target string. By setting the count=1 inside a re. sub() we can replace only the first occurrence of a pattern in the target string with another string. Set the count value to the number of replacements you want to perform.

How do you specify a pattern that captures one or more whitespace characters?

\s* - 0+ whitespaces. = - a literal = (\S*) - Group 1 capturing 0+ chars other than whitespace (or \S+ can be used to match 1 or more chars other than whitespace).

How do I match a regex pattern?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).


1 Answers

In general, alternations in regular expressions are evaluated from left to right, so the leftmost alternatives are checked first, giving them priority. You already did that, though - the reason why you still got the match from the right side of the alternation is that that match is possible earlier in the string.

Therefore you need to be more specific and only allow a "Foo University" match only if no of follows. You can use a negative lookahead assertion for this:

regex = re.compile('|'.join([r'university of (\w+){1,3}',
                             r'(?:\S+\s){1,3}\S*university(?!\s+of\b)']),
                   flags=re.I)
like image 67
Tim Pietzcker Avatar answered Sep 30 '22 20:09

Tim Pietzcker