Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex findall

Tags:

python

regex

I am trying to extract all occurrences of tagged words from a string using regex in Python 2.7.2. Or simply, I want to extract every piece of text inside the [p][/p] tags. Here is my attempt:

regex = ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?" line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday." person = re.findall(pattern, line) 

Printing person produces ['President [P]', '[/P]', '[P] Bill Gates [/P]']

What is the correct regex to get: ['[P] Barack Obama [/P]', '[P] Bill Gates [/p]'] or ['Barrack Obama', 'Bill Gates'].

like image 521
Ignatius Avatar asked Oct 13 '11 10:10

Ignatius


People also ask

What is Findall in RegEx?

The findall() function scans the string from left to right and finds all the matches of the pattern in the string .

What does Findall () do?

findall() finds *all* the matches and returns them as a list of strings, with each string representing one match.

What is difference between Search () and Findall () methods in Python?

Here you can see that, search() method is able to find a pattern from any position of the string. The re. findall() helps to get a list of all matching patterns. It searches from start or end of the given string.

How does re Findall work in Python?

How Does the findall() Method Work in Python? The re. findall(pattern, string) method scans string from left to right, searching for all non-overlapping matches of the pattern . It returns a list of strings in the matching order when scanning the string from left to right.


1 Answers

import re regex = ur"\[P\] (.+?) \[/P\]+?" line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday." person = re.findall(regex, line) print(person) 

yields

['Barack Obama', 'Bill Gates'] 

The regex ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?" is exactly the same unicode as u'[[1P].+?[/P]]+?' except harder to read.

The first bracketed group [[1P] tells re that any of the characters in the list ['[', '1', 'P'] should match, and similarly with the second bracketed group [/P]].That's not what you want at all. So,

  • Remove the outer enclosing square brackets. (Also remove the stray 1 in front of P.)
  • To protect the literal brackets in [P], escape the brackets with a backslash: \[P\].
  • To return only the words inside the tags, place grouping parentheses around .+?.
like image 114
unutbu Avatar answered Sep 25 '22 15:09

unutbu