Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python re: Storing multiple matches in variables

Tags:

python

regex

I want to match different parts of a string and store them in separate variables for later use. For example,

string = "bunch(oranges, bananas, apples)"
rxp = "[a-z]*\([var1]\, [var2]\, [var3]\)"

so that I have

var1 = "oranges"
var2 = "bananas"
var3 = "apples"

Something like what re.search() does but for multiple different parts of the same match.

EDIT: the number of fruits in the list is not known beforehand. Should have put this in with the question.

like image 685
Arish Avatar asked Nov 18 '12 21:11

Arish


People also ask

What is re multiline in Python?

The re. MULTILINE flag tells python to make the '^' and '$' special characters match the start or end of any line within a string. Using this flag: >>> match = re.search(r'^It has.

Is there any difference between re match () and re search () in the Python re module?

There is a difference between the use of both functions. Both return the first match of a substring found in the string, but re. match() searches only from the beginning of the string and return match object if found.

What does re Ignorecase do in Python?

re. IGNORECASE : This flag allows for case-insensitive matching of the Regular Expression with the given string i.e. expressions like [A-Z] will match lowercase letters, too. Generally, It's passed as an optional argument to re. compile() .

What is Match Group () in Python?

re.MatchObject.group() method returns the complete matched subgroup by default or a tuple of matched subgroups depending on the number of arguments.


4 Answers

That is what re.search does. Just use capturing groups (parentheses) to access the stuff that was matched by certain subpatterns later on:

>>> import re
>>> m = re.search(r"[a-z]*\(([a-z]*), ([a-z]*), ([a-z]*)\)", string)
>>> m.group(0)
'bunch(oranges, bananas, apples)'
>>> m.group(1)
'oranges'
>>> m.group(2)
'bananas'
>>> m.group(3)
'apples'

Also note, that I used a raw string to avoid the double backslashes.

If your number of "variables" inside bunch can vary, you have a problem. Most regex engines cannot capture a variable number of strings. However in that case you could get away with this:

>>> m = re.search(r"[a-z]*\(([a-z, ]*)\)", string)
>>> m.group(1)
'oranges, bananas, apples'
>>> m.group(1).split(', ')
['oranges', 'bananas', 'apples']
like image 52
Martin Ender Avatar answered Sep 29 '22 18:09

Martin Ender


For regular expressions, you can use the match() function to do what you want, and use groups to get your results. Also, don't assign to the word string, as that is a built-in function (even though it's deprecated). For your example, if you know there are always the same number of fruits each time, it looks like this:

import re
input = "bunch(oranges, bananas, apples)"
var1, var2, var3 = re.match('bunch\((\w+), (\w+), (\w+)\)', input).group(1, 2, 3)

Here, I used the \w special sequence, which matches any alphanumeric character or underscore, as explained in the documentation

If you don't know the number of fruits in advance, you can use two regular expression calls, one to get extract the minimal part of the string where the fruits are listed, getting rid of "bunch" and the parentheses, then finditer to extract the names of the fruits:

import re
input = "bunch(oranges, bananas, apples)"
[m.group(0) for m in re.finditer('\w+(, )?', re.match('bunch\(([^)]*)\)', input).group(1))] 
like image 32
acjay Avatar answered Sep 29 '22 17:09

acjay


If you want, you can use groupdict to store matching items in a dictionary:

regex = re.compile("[a-z]*\((?P<var1>.*)\, (?P<var2>.*)\, (?P<var3>.*)")
match = regex.match("bunch(oranges, bananas, apples)")
if match:
    match.groupdict()

#{'var1': 'oranges', 'var2': 'bananas', 'var3': 'apples)'}
like image 22
tehmisvh Avatar answered Sep 29 '22 18:09

tehmisvh


Don't. Every time you use var1, var2 etc, you actually want a list. Unfortunately, this is no way to collect arbitrary number of subgroups in a list using findall, but you can use a hack like this:

import re
lst = []
re.sub(r'([a-z]+)(?=[^()]*\))', lambda m: lst.append(m.group(1)), string)
print lst # ['oranges', 'bananas', 'apples']

Note that this works not only for this specific example, but also for any number of substrings.

like image 26
georg Avatar answered Sep 29 '22 17:09

georg