Python regular expression to replace unencoded ampersands in text

Question

I'm working with an upstream system that sometimes sends me text destined for HTML/XML output with ampersands that are unencoded:

str1 = "Stay at this B&B"
str2 = "He&#8217;s going to Texas A&M"
str3 = "He&#8217;s going to a B&amp;B and then Texas A&M"

I need to replace the unencoded ampersands with & while preserving the ones that are part of character references or are already encoded.

(Fixing the upstream system isn't an option, and since the text sometimes arrives partially encoded, re-encoding the whole string isn't something I can do, either. I'd really just like to fix this nagging issue and get on with my life)

This regular expression is catching it fine, but I'm having trouble figuring out the syntax to do a re.sub:

re.findall("&[^#|amp]", str3)

I'm not sure how to properly substitute the text; I have a feeling it's going to involve re.group but that's a weakness in my regular expression-foo.

Any help is appreciated.

Alan Moore · Accepted Answer

If the ampersand is part of a character entity, it can be any named entity (not just &), a decimal entity, OR a hexadecimal entity. This should cover it:

re.sub(r'&(?![A-Za-z]+[0-9]*;|#[0-9]+;|#x[0-9a-fA-F]+;)',
       r'&amp;', your_string)

Andrew Clark · Answer

I would suggest using a negative lookahead for this. It will cause the match to fail if the & is followed by #xxxx; (where x is a digit) or amp;, so it will only match standalone & characters and replace them with &.

re.sub(r"&(?!#\d{4};|amp;)", "&amp;", your_string)

Python regular expression to replace unencoded ampersands in text

Tags:

python

regex

Scott

2 Answers

Alan Moore

Andrew Clark

Recent Activity

Donate For Us

Python regular expression to replace unencoded ampersands in text

Tags:

python

regex

Scott

2 Answers

Alan Moore

Andrew Clark

Related questions

Recent Activity

Donate For Us