I'm trying to achieve the following replacement in python. Replace all html tags with {n}
& create a hash of [tag, {n}]
Original string -> "<h>
This is a string. </H><P>
This is another part. </P>
"
Replaced text -> "{0} This is a string. {1}{2} This is another part. {3}"
Here's my code. I've started with replacement but I'm stuck at the replacement logic as I cannot figure out the best way to replace each occurrence in a consecutive manner i.e with {0}, {1} and so on:
import re
text = "<h> This is a string. </H><p> This is another part. </P>"
num_mat = re.findall(r"(?:<(\/*)[a-zA-Z0-9]+>)",text)
print(str(len(num_mat)))
reg = re.compile(r"(?:<(\/*)[a-zA-Z0-9]+>)",re.VERBOSE)
phctr = 0
#for phctr in num_mat:
# phtxt = "{" + str(phctr) + "}"
phtxt = "{" + str(phctr) + "}"
newtext = re.sub(reg,phtxt,text)
print(newtext)
Can someone help with a better way of achieving this? Thank you!
To replace a string in Python, the regex sub() method is used. It is a built-in Python method in re module that returns replaced string. Don't forget to import the re module. This method searches the pattern in the string and then replace it with a new given expression.
The replace() method replace() is a built-in method in Python that replaces all the occurrences of the old character with the new character.
The replace() in Python returns a copy of the string where all occurrences of a substring are replaced with another substring.
The replace() method is a built-in functionality offered in Python programming. It replaces all the occurrences of the old substring with the new substring. Replace() returns a new string in which old substring is replaced with the new substring.
import re
import itertools as it
text = "<h> This is a string. </H><p> This is another part. </P>"
cnt = it.count()
print re.sub(r"</?\w+>", lambda x: '{{{}}}'.format(next(cnt)), text)
prints
{0} This is a string. {1}{2} This is another part. {3}
Works for simple tags only (no attributes/spaces in tags). For extended tags, you have to adapt the regexp.
Also, not reinitializing cnt = it.count()
will keep the numbering going on.
UPDATE to get a mapping dict:
import re
import itertools as it
text = "<h> This is a string. </H><p> This is another part. </P>"
cnt = it.count()
d = {}
def replace(tag, d, cnt):
if tag not in d:
d[tag] = '{{{}}}'.format(next(cnt))
return d[tag]
print re.sub(r"(</?\w+>)", lambda x: replace(x.group(1), d, cnt), text)
print d
prints:
{0} This is a string. {1}{2} This is another part. {3}
{'</P>': '{3}', '<h>': '{0}', '<p>': '{2}', '</H>': '{1}'}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With