new to python. This is probably simple but I haven't found an answer.
rndStr = "20101215"
rndStr2 = "20101216"
str = "Looking at dates between 20110316 and 20110317"
outstr = re.sub("(.+)([0-9]{8})(.+)([0-9]{8})",r'\1'+rndStr+r'\2'+rndStr2,str)
The output I'm looking for is:
Looking at dates between 20101215 and 20101216
But instead I get:
P101215101216
The values of the two rndStr's doesn't really matter. Assume its random or taken from user input (I put static vals here to keep it simple). Thanks for any help.
sub() function belongs to the Regular Expressions ( re ) module in Python. It returns a string where all matching occurrences of the specified pattern are replaced by the replace string.
Match objects in Python regex match. group() returns the match from the string. This would be a15 in our first example. match. start() and match.
To get access to the text matched by each regex group, pass the group's number to the group(group_number) method. So the first group will be a group of 1. The second group will be a group of 2 and so on. So this is the simple way to access each of the groups as long as the patterns were matched.
Your backreferences are ambiguous. Your replacement string becomes
\120101215\220101216
which is two rather large numbers to be backreferencing :)
To solve it, use this syntax:
r'\g<1>'+rndStr+r'\g<2>'+rndStr2
You also have too many sets of parentheses (or "brackets" if you speak British English like me:) - you don't need parentheses around the [0-9]{8}
parts which you're not backreferencing:
re.sub("(.+)[0-9]{8}(.+)[0-9]{8}",...
should be sufficient.
(And, as noted elsewhere, don't use str
as a variable name. Unless you want to spend ages debugging why str.replace()
doesn't work anymore. Not that I ever did that once... noooo. :)
so the whole thing becomes:
import re
rndStr = "20101215"
rndStr2 = "20101216"
s = "Looking at dates between 20110316 and 20110317"
outstr = re.sub("(.+)[0-9]{8}(.+)[0-9]{8}", r'\g<1>'+rndStr+r'\g<2>'+rndStr2, s)
print outstr
Producing:
Looking at dates between 20101215 and 20101216
Notice if you change the value of rndStr
or rndStr2
to text (like 'abc') rather than digits, you get something closer to the expected result?
In your expression to re.sub
you have r'\1'+rndStr+...
This combines into '\1'+'20101215'
which then tries to reference the back reference of \120101215
which is probably not what you intended...
You can use named back references to make the back reference unambiguous:
rep1 = "20101215"
rep2 = "20101216"
st = "Looking at dates between 20110316 and 20110317"
print re.sub(r'(?P<fp>.+)[0-9]{8}(?P<lp>.+)[0-9]{8}',
r'\g<fp>'+rep1+r'\g<lp>'+rep2,st)
Better still, use an easier to understand syntax and check the return of the attempted match:
m=re.search(r'(?P<fp>.+)[0-9]{8}(?P<lp>.+)[0-9]{8}',st)
if m:
print m.group('fp')+rep1+m.group('lp')+rep2 #you could use m.group(1) too
else:
print "no match..."
In either case, your desired string of Looking at dates between 20101215 and 20101216
is produced.
The Python docs on named backreferences:
(?P<name>...)
Similar to regular parentheses, but the substring matched by the group is accessible within the rest of the regular expression via the symbolic group name 'name'. Group names must be valid Python identifiers, and each group name must be defined only once within a regular expression. A symbolic group is also a numbered group, just as if the group were not named. So the group named
'id'
in the example below can also be referenced as the numbered group 1.For example, if the pattern is
(?P<id>[a-zA-Z_]\w*)
, the group can be referenced by its name in arguments to methods of match objects, such asm.group('id')
orm.end('id')
, and also by name in the regular expression itself (using(?P=id)
) and replacement text given to.sub()
(using\g<id>
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With