I'm working with an upstream system that sometimes sends me text destined for HTML/XML output with ampersands that are unencoded:
str1 = "Stay at this B&B"
str2 = "He’s going to Texas A&M"
str3 = "He’s going to a B&B and then Texas A&M"
I need to replace the unencoded ampersands with &
while preserving the ones that are part of character references or are already encoded.
(Fixing the upstream system isn't an option, and since the text sometimes arrives partially encoded, re-encoding the whole string isn't something I can do, either. I'd really just like to fix this nagging issue and get on with my life)
This regular expression is catching it fine, but I'm having trouble figuring out the syntax to do a re.sub
:
re.findall("&[^#|amp]", str3)
I'm not sure how to properly substitute the text; I have a feeling it's going to involve re.group
but that's a weakness in my regular expression-foo.
Any help is appreciated.
If the ampersand is part of a character entity, it can be any named entity (not just &
), a decimal entity, OR a hexadecimal entity. This should cover it:
re.sub(r'&(?![A-Za-z]+[0-9]*;|#[0-9]+;|#x[0-9a-fA-F]+;)',
r'&', your_string)
I would suggest using a negative lookahead for this. It will cause the match to fail if the &
is followed by #xxxx;
(where x is a digit) or amp;
, so it will only match standalone &
characters and replace them with &
.
re.sub(r"&(?!#\d{4};|amp;)", "&", your_string)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With