Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regular expression to replace unencoded ampersands in text

Tags:

python

regex

I'm working with an upstream system that sometimes sends me text destined for HTML/XML output with ampersands that are unencoded:

str1 = "Stay at this B&B"
str2 = "He’s going to Texas A&M"
str3 = "He’s going to a B&B and then Texas A&M"

I need to replace the unencoded ampersands with & while preserving the ones that are part of character references or are already encoded.

(Fixing the upstream system isn't an option, and since the text sometimes arrives partially encoded, re-encoding the whole string isn't something I can do, either. I'd really just like to fix this nagging issue and get on with my life)

This regular expression is catching it fine, but I'm having trouble figuring out the syntax to do a re.sub:

re.findall("&[^#|amp]", str3)

I'm not sure how to properly substitute the text; I have a feeling it's going to involve re.group but that's a weakness in my regular expression-foo.

Any help is appreciated.

like image 765
Scott Avatar asked Jan 04 '12 17:01

Scott


2 Answers

If the ampersand is part of a character entity, it can be any named entity (not just &), a decimal entity, OR a hexadecimal entity. This should cover it:

re.sub(r'&(?![A-Za-z]+[0-9]*;|#[0-9]+;|#x[0-9a-fA-F]+;)',
       r'&', your_string)
like image 197
Alan Moore Avatar answered Nov 03 '22 16:11

Alan Moore


I would suggest using a negative lookahead for this. It will cause the match to fail if the & is followed by #xxxx; (where x is a digit) or amp;, so it will only match standalone & characters and replace them with &.

re.sub(r"&(?!#\d{4};|amp;)", "&", your_string)
like image 5
Andrew Clark Avatar answered Nov 03 '22 15:11

Andrew Clark