Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find/replace text in html while preserving html tags/structure

I use regexps to transform text as I want, but I want to preserve the HTML tags. e.g. if I want to replace "stack overflow" with "stack underflow", this should work as expected: if the input is stack <sometag>overflow</sometag>, I must obtain stack <sometag>underflow</sometag> (i.e. the string substitution is done, but the tags are still there...

like image 379
vbfoobar Avatar asked Dec 06 '09 17:12

vbfoobar


3 Answers

Use a DOM library, not regular expressions, when dealing with manipulating HTML:

  • lxml: a parser, document, and HTML serializer. Also can use BeautifulSoup and html5lib for parsing.
  • BeautifulSoup: a parser, document, and HTML serializer.
  • html5lib: a parser. It has a serializer.
  • ElementTree: a document object, and XML serializer
  • cElementTree: a document object implemented as a C extension.
  • HTMLParser: a parser.
  • Genshi: includes a parser, document, and HTML serializer.
  • xml.dom.minidom: a document model built into the standard library, which html5lib can parse to.

Stolen from http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/.

Out of these I would recommend lxml, html5lib, and BeautifulSoup.

like image 92
meder omuraliev Avatar answered Nov 09 '22 23:11

meder omuraliev


Beautiful Soup or HTMLParser is your answer.

like image 3
duffymo Avatar answered Nov 09 '22 23:11

duffymo


Note that arbitrary replacements can't be done unambiguously. Consider the following examples:

1)

HTML:

A<tag>B</tag>

Pattern -> replacement:

AB -> AXB

Possible results:

AX<tag>B</tag>
A<tag>XB</tag>

2)

HTML:

A<tag>A</tag>A

Pattern -> replacement:

A+ -> WXYZ

Possible results:

W<tag />XYZ
W<tag>X</tag>YZ
W<tag>XY</tag>Z
W<tag>XYZ</tag>
WX<tag />YZ
WX<tag>Y</tag>Z
WX<tag>YZ</tag>
WXY<tag />Z
WXY<tag>Z</tag>
WXYZ

What kind of algorithms work for your case depends highly on the nature of possible search patterns and desired rules for handling ambiguity.

like image 3
akaihola Avatar answered Nov 10 '22 00:11

akaihola