I use regexps to transform text as I want, but I want to preserve the HTML tags.
e.g. if I want to replace "stack overflow" with "stack underflow", this should work as
expected: if the input is stack <sometag>overflow</sometag>
, I must obtain stack <sometag>underflow</sometag>
(i.e. the string substitution is done, but the
tags are still there...
Use a DOM library, not regular expressions, when dealing with manipulating HTML:
Stolen from http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/.
Out of these I would recommend lxml, html5lib, and BeautifulSoup.
Beautiful Soup or HTMLParser is your answer.
Note that arbitrary replacements can't be done unambiguously. Consider the following examples:
HTML:
A<tag>B</tag>
Pattern -> replacement:
AB -> AXB
Possible results:
AX<tag>B</tag>
A<tag>XB</tag>
HTML:
A<tag>A</tag>A
Pattern -> replacement:
A+ -> WXYZ
Possible results:
W<tag />XYZ
W<tag>X</tag>YZ
W<tag>XY</tag>Z
W<tag>XYZ</tag>
WX<tag />YZ
WX<tag>Y</tag>Z
WX<tag>YZ</tag>
WXY<tag />Z
WXY<tag>Z</tag>
WXYZ
What kind of algorithms work for your case depends highly on the nature of possible search patterns and desired rules for handling ambiguity.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With