Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to sub with matched groups and variables in Python

Tags:

python

regex

new to python. This is probably simple but I haven't found an answer.

rndStr = "20101215"
rndStr2 = "20101216"
str = "Looking at dates between 20110316 and 20110317"
outstr = re.sub("(.+)([0-9]{8})(.+)([0-9]{8})",r'\1'+rndStr+r'\2'+rndStr2,str)

The output I'm looking for is:

Looking at dates between 20101215 and 20101216

But instead I get:

P101215101216

The values of the two rndStr's doesn't really matter. Assume its random or taken from user input (I put static vals here to keep it simple). Thanks for any help.

like image 390
Syed H Avatar asked Mar 16 '11 21:03

Syed H


People also ask

What is sub () in regex Python?

sub() function belongs to the Regular Expressions ( re ) module in Python. It returns a string where all matching occurrences of the specified pattern are replaced by the replace string.

What is Match Group () in Python?

Match objects in Python regex match. group() returns the match from the string. This would be a15 in our first example. match. start() and match.

How do you find the substring that matched the last capturing group of the regex?

To get access to the text matched by each regex group, pass the group's number to the group(group_number) method. So the first group will be a group of 1. The second group will be a group of 2 and so on. So this is the simple way to access each of the groups as long as the patterns were matched.


2 Answers

Your backreferences are ambiguous. Your replacement string becomes

\120101215\220101216

which is two rather large numbers to be backreferencing :)

To solve it, use this syntax:

r'\g<1>'+rndStr+r'\g<2>'+rndStr2 

You also have too many sets of parentheses (or "brackets" if you speak British English like me:) - you don't need parentheses around the [0-9]{8} parts which you're not backreferencing:

re.sub("(.+)[0-9]{8}(.+)[0-9]{8}",...

should be sufficient.

(And, as noted elsewhere, don't use str as a variable name. Unless you want to spend ages debugging why str.replace() doesn't work anymore. Not that I ever did that once... noooo. :)

so the whole thing becomes:

import re
rndStr = "20101215"
rndStr2 = "20101216"
s = "Looking at dates between 20110316 and 20110317"
outstr = re.sub("(.+)[0-9]{8}(.+)[0-9]{8}", r'\g<1>'+rndStr+r'\g<2>'+rndStr2, s) 
print outstr

Producing:

Looking at dates between 20101215 and 20101216
like image 66
Martin Thompson Avatar answered Sep 19 '22 12:09

Martin Thompson


Notice if you change the value of rndStr or rndStr2 to text (like 'abc') rather than digits, you get something closer to the expected result?

In your expression to re.sub you have r'\1'+rndStr+... This combines into '\1'+'20101215' which then tries to reference the back reference of \120101215 which is probably not what you intended...

You can use named back references to make the back reference unambiguous:

rep1 = "20101215"
rep2 = "20101216"
st = "Looking at dates between 20110316 and 20110317"

print re.sub(r'(?P<fp>.+)[0-9]{8}(?P<lp>.+)[0-9]{8}',
            r'\g<fp>'+rep1+r'\g<lp>'+rep2,st)

Better still, use an easier to understand syntax and check the return of the attempted match:

m=re.search(r'(?P<fp>.+)[0-9]{8}(?P<lp>.+)[0-9]{8}',st)
if m:
    print m.group('fp')+rep1+m.group('lp')+rep2  #you could use m.group(1) too
else:
    print "no match..."

In either case, your desired string of Looking at dates between 20101215 and 20101216 is produced.

The Python docs on named backreferences:

(?P<name>...)

Similar to regular parentheses, but the substring matched by the group is accessible within the rest of the regular expression via the symbolic group name 'name'. Group names must be valid Python identifiers, and each group name must be defined only once within a regular expression. A symbolic group is also a numbered group, just as if the group were not named. So the group named 'id' in the example below can also be referenced as the numbered group 1.

For example, if the pattern is (?P<id>[a-zA-Z_]\w*), the group can be referenced by its name in arguments to methods of match objects, such as m.group('id') or m.end('id'), and also by name in the regular expression itself (using (?P=id)) and replacement text given to .sub() (using \g<id>).

like image 35
dawg Avatar answered Sep 22 '22 12:09

dawg