Why does my regex pattern not capture the word before the preposition?
My regex pattern is trying to capture Proper Nouns that have prepositions after them. For instance: • Academy of Management --> Academy of • McGraw Hill Foundation of Books --> Foundation of
For the following text:
'The Academy of Management Entrepreneurship Division and McGraw Hill present the annual award to individuals who develop and implement an innovation in entrepreneurship pedagogy for either graduate or undergraduate education.'
pp = r'[A-Z][A-Za-z]+\s+\b(for|of|in|by)\b(?=\s+[A-Z][A-Za-z]+)'
x2 = re.findall(pp,test)
x2
outputs:
'of'
Why doesn't it output 'Academy of'?
A capturing group is a section of a regular expression enclosed in parentheses ( ). They are used to extract specific sections from a matching expression. It looks like you've encountered them by chance, as you're using one to match "for", "of", "in" or "by".
When you have one capturing group in your expression (as in your question), re.findall will return a list of matches for that group. At the moment, you don't have any group around the first part of your regular expression. If you want to capture it as well, you must also enclose it in some parentheses:
pp=r'([A-Z][A-Za-z]+\s+\b(for|of|in|by))\b(?=\s+[A-Z][A-Za-z]+)'
# ^ ^
re.findall(pp,test)
returns:
[('Academy of', 'of')]
Now re.findall has returned a list of tuples because there are now multiple capturing groups. The elements of the tuple appear in the order that the groups begin.
If you don't want to also match the other group, you can change it to be non-capturing:
(?:for|of|in|by)
Then the only thing that will be captured is ['Academy of']. Although now you're left with only one capture group, so you can dispense with the parentheses entirely and re.findall will return anything matching the full regular expression.
pp=r'[A-Z][A-Za-z]+\s+\b(?:for|of|in|by)\b(?=\s+[A-Z][A-Za-z]+)'
Just put a capture group for the word before the preposition:
pp = r'([A-Z][A-Za-z]+)\s+\b(for|of|in|by)\b(?=\s+[A-Z][A-Za-z]+)'
Or if you want to capture the whole word/preposition string:
pp = r'([A-Z][A-Za-z]+\s+\b(?:for|of|in|by))\b(?=\s+[A-Z][A-Za-z]+)'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With