How to find/replace text in html while preserving html tags/structure

Question

I use regexps to transform text as I want, but I want to preserve the HTML tags. e.g. if I want to replace "stack overflow" with "stack underflow", this should work as expected: if the input is stack <sometag>overflow</sometag>, I must obtain stack <sometag>underflow</sometag> (i.e. the string substitution is done, but the tags are still there...

meder omuraliev · Accepted Answer

Use a DOM library, not regular expressions, when dealing with manipulating HTML:

lxml: a parser, document, and HTML serializer. Also can use BeautifulSoup and html5lib for parsing.
BeautifulSoup: a parser, document, and HTML serializer.
html5lib: a parser. It has a serializer.
ElementTree: a document object, and XML serializer
cElementTree: a document object implemented as a C extension.
HTMLParser: a parser.
Genshi: includes a parser, document, and HTML serializer.
xml.dom.minidom: a document model built into the standard library, which html5lib can parse to.

Stolen from http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/.

Out of these I would recommend lxml, html5lib, and BeautifulSoup.

duffymo · Answer

Beautiful Soup or HTMLParser is your answer.

akaihola · Answer

Note that arbitrary replacements can't be done unambiguously. Consider the following examples:

1)

HTML:

A<tag>B</tag>

Pattern -> replacement:

AB -> AXB

Possible results:

AX<tag>B</tag>
A<tag>XB</tag>

2)

HTML:

A<tag>A</tag>A

Pattern -> replacement:

A+ -> WXYZ

Possible results:

W<tag />XYZ
W<tag>X</tag>YZ
W<tag>XY</tag>Z
W<tag>XYZ</tag>
WX<tag />YZ
WX<tag>Y</tag>Z
WX<tag>YZ</tag>
WXY<tag />Z
WXY<tag>Z</tag>
WXYZ

What kind of algorithms work for your case depends highly on the nature of possible search patterns and desired rules for handling ambiguity.

How to find/replace text in html while preserving html tags/structure

Tags:

python

html

html-parsing

vbfoobar

3 Answers

meder omuraliev

duffymo

1)

2)

akaihola

Recent Activity

Donate For Us

How to find/replace text in html while preserving html tags/structure

Tags:

python

html

html-parsing

vbfoobar

3 Answers

meder omuraliev

duffymo

1)

2)

akaihola

Related questions

Recent Activity

Donate For Us