I am trying to extract all occurrences of tagged words from a string using regex in Python 2.7.2. Or simply, I want to extract every piece of text inside the [p][/p]
tags. Here is my attempt:
regex = ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?" line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday." person = re.findall(pattern, line)
Printing person
produces ['President [P]', '[/P]', '[P] Bill Gates [/P]']
What is the correct regex to get: ['[P] Barack Obama [/P]', '[P] Bill Gates [/p]']
or ['Barrack Obama', 'Bill Gates']
.
The findall() function scans the string from left to right and finds all the matches of the pattern in the string .
findall() finds *all* the matches and returns them as a list of strings, with each string representing one match.
Here you can see that, search() method is able to find a pattern from any position of the string. The re. findall() helps to get a list of all matching patterns. It searches from start or end of the given string.
How Does the findall() Method Work in Python? The re. findall(pattern, string) method scans string from left to right, searching for all non-overlapping matches of the pattern . It returns a list of strings in the matching order when scanning the string from left to right.
import re regex = ur"\[P\] (.+?) \[/P\]+?" line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday." person = re.findall(regex, line) print(person)
yields
['Barack Obama', 'Bill Gates']
The regex ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?"
is exactly the same unicode as u'[[1P].+?[/P]]+?'
except harder to read.
The first bracketed group [[1P]
tells re that any of the characters in the list ['[', '1', 'P']
should match, and similarly with the second bracketed group [/P]]
.That's not what you want at all. So,
1
in front of P
.)[P]
, escape the brackets with a backslash: \[P\]
..+?
.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With