I want to match different parts of a string and store them in separate variables for later use. For example,
string = "bunch(oranges, bananas, apples)"
rxp = "[a-z]*\([var1]\, [var2]\, [var3]\)"
so that I have
var1 = "oranges"
var2 = "bananas"
var3 = "apples"
Something like what re.search() does but for multiple different parts of the same match.
EDIT: the number of fruits in the list is not known beforehand. Should have put this in with the question.
The re. MULTILINE flag tells python to make the '^' and '$' special characters match the start or end of any line within a string. Using this flag: >>> match = re.search(r'^It has.
There is a difference between the use of both functions. Both return the first match of a substring found in the string, but re. match() searches only from the beginning of the string and return match object if found.
re. IGNORECASE : This flag allows for case-insensitive matching of the Regular Expression with the given string i.e. expressions like [A-Z] will match lowercase letters, too. Generally, It's passed as an optional argument to re. compile() .
re.MatchObject.group() method returns the complete matched subgroup by default or a tuple of matched subgroups depending on the number of arguments.
That is what re.search
does. Just use capturing groups (parentheses) to access the stuff that was matched by certain subpatterns later on:
>>> import re
>>> m = re.search(r"[a-z]*\(([a-z]*), ([a-z]*), ([a-z]*)\)", string)
>>> m.group(0)
'bunch(oranges, bananas, apples)'
>>> m.group(1)
'oranges'
>>> m.group(2)
'bananas'
>>> m.group(3)
'apples'
Also note, that I used a raw string to avoid the double backslashes.
If your number of "variables" inside bunch
can vary, you have a problem. Most regex engines cannot capture a variable number of strings. However in that case you could get away with this:
>>> m = re.search(r"[a-z]*\(([a-z, ]*)\)", string)
>>> m.group(1)
'oranges, bananas, apples'
>>> m.group(1).split(', ')
['oranges', 'bananas', 'apples']
For regular expressions, you can use the match()
function to do what you want, and use groups to get your results. Also, don't assign to the word string
, as that is a built-in function (even though it's deprecated). For your example, if you know there are always the same number of fruits each time, it looks like this:
import re
input = "bunch(oranges, bananas, apples)"
var1, var2, var3 = re.match('bunch\((\w+), (\w+), (\w+)\)', input).group(1, 2, 3)
Here, I used the \w
special sequence, which matches any alphanumeric character or underscore, as explained in the documentation
If you don't know the number of fruits in advance, you can use two regular expression calls, one to get extract the minimal part of the string where the fruits are listed, getting rid of "bunch" and the parentheses, then finditer
to extract the names of the fruits:
import re
input = "bunch(oranges, bananas, apples)"
[m.group(0) for m in re.finditer('\w+(, )?', re.match('bunch\(([^)]*)\)', input).group(1))]
If you want, you can use groupdict
to store matching items in a dictionary:
regex = re.compile("[a-z]*\((?P<var1>.*)\, (?P<var2>.*)\, (?P<var3>.*)")
match = regex.match("bunch(oranges, bananas, apples)")
if match:
match.groupdict()
#{'var1': 'oranges', 'var2': 'bananas', 'var3': 'apples)'}
Don't. Every time you use var1, var2 etc, you actually want a list. Unfortunately, this is no way to collect arbitrary number of subgroups in a list using findall
, but you can use a hack like this:
import re
lst = []
re.sub(r'([a-z]+)(?=[^()]*\))', lambda m: lst.append(m.group(1)), string)
print lst # ['oranges', 'bananas', 'apples']
Note that this works not only for this specific example, but also for any number of substrings.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With