<p>I have lots of HTML files. I want to replace some elements, keeping all the other content unchanged. For example, I would like to execute this jQuery expression (or some equivalent of it):</p> <pre class="prettyprint"><code>$('.header .title').text('my new content') </code></pre> <p>on the following HTML document:</p> <pre class="prettyprint"><code><div class=header><span class=title>Foo</span></div> <p>1<p>2 <table><tr><td>1</td></tr></table> </code></pre> <p>and have the following result:</p> <pre class="prettyprint"><code><div class=header><span class=title>my new content</span></div> <p>1<p>2 <table><tr><td>1</td></tr></table> </code></pre> <p>The problem is, all parsers I’ve tried (Nokogiri, BeautifulSoup, html5lib) serialize it to something like this:</p> <pre class="prettyprint"><code><html> <head></head> <body> <div class=header><span class=title>my new content</span></div> <p>1</p><p>2</p> <table><tbody><tr><td>1</td></tr></tbody></table> </body> </html> </code></pre> <p>E.g. they add:</p> <ol> <li>html, head and body elements</li> <li>closing p tags</li> <li>tbody</li> </ol> <p>Is there a parser that satisfies my needs? It should work in either Node.js, Ruby or Python.</p>

<p>I highly recommend the pyquery package, for python. It is a jquery-like interface layered ontop of the extremely reliable lxml package, a python binding to libxml2.</p> <p>I believe this does exactly what you want, with a quite familiar interface.</p> <pre class="prettyprint"><code>from pyquery import PyQuery as pq html = ''' <div class=header><span class=title>Foo</span></div> <p>1<p>2 <table><tr><td>1</td></tr></table> ''' doc = pq(html) doc('.header .title').text('my new content') print doc </code></pre> <p>Output:</p> <pre class="prettyprint"><code><div><div class="header"><span class="title">my new content</span></div> <p>1</p><p>2 </p><table><tr><td>1</td></tr></table></div> </code></pre> <p>The closing p tag can't be helped. <code>lxml</code> only keeps the <em>values</em> from the original document, not the vagaries of the original. Paragraphs can be made two ways, and it chooses the more standard way when doing serialization. I don't believe you'll find a (bug-free) parser that does better.</p>

Parse HTML and preserve original content

Tags:

python

html

node.js

html-parsing

ruby

I have lots of HTML files. I want to replace some elements, keeping all the other content unchanged. For example, I would like to execute this jQuery expression (or some equivalent of it):

$('.header .title').text('my new content')

on the following HTML document:

<div class=header><span class=title>Foo</span></div>
<p>1<p>2
<table><tr><td>1</td></tr></table>

and have the following result:

<div class=header><span class=title>my new content</span></div>
<p>1<p>2
<table><tr><td>1</td></tr></table>

The problem is, all parsers I’ve tried (Nokogiri, BeautifulSoup, html5lib) serialize it to something like this:

<html>
  <head></head>
  <body>
    <div class=header><span class=title>my new content</span></div>
    <p>1</p><p>2</p>
    <table><tbody><tr><td>1</td></tr></tbody></table>
  </body>
</html>

E.g. they add:

html, head and body elements
closing p tags
tbody

Is there a parser that satisfies my needs? It should work in either Node.js, Ruby or Python.

422

asked Jul 20 '12 09:07

NVI

1 Answers

I highly recommend the pyquery package, for python. It is a jquery-like interface layered ontop of the extremely reliable lxml package, a python binding to libxml2.

I believe this does exactly what you want, with a quite familiar interface.

from pyquery import PyQuery as pq
html = '''
<div class=header><span class=title>Foo</span></div>
<p>1<p>2
<table><tr><td>1</td></tr></table>
'''
doc = pq(html)

doc('.header .title').text('my new content')
print doc

Output:

<div><div class="header"><span class="title">my new content</span></div>
<p>1</p><p>2
</p><table><tr><td>1</td></tr></table></div>

The closing p tag can't be helped. lxml only keeps the values from the original document, not the vagaries of the original. Paragraphs can be made two ways, and it chooses the more standard way when doing serialization. I don't believe you'll find a (bug-free) parser that does better.

answered Sep 18 '22 12:09

bukzor

Related questions
                            
                                Whoosh index viewer
                            
                                Iterating through model fields - Django
                            
                                Python save matplotlib figure on an PIL Image object
                            
                                How can I use pyparsing to parse nested expressions that have multiple opener/closer types?
                            
                                Reuse existing objects for immutable objects?
                            
                                How do I install PyOpenSSL on Windows 7 64-bit?
                            
                                In the Inline "open and write file" is the close() implicit?
                            
                                Some problem with dict function
                            
                                What is the Python equivalent of Comparables in Java?
                            
                                Color states with Python's matplotlib/basemap
                            
                                Multiple domains and subdomains on a single Pyramid instance
                            
                                How do I exclude South migrations from Pylint?
                            
                                Python - matplotlib: find intersection of lineplots
                            
                                What does the Python logo mean? [closed]
                            
                                "add to set" returns a boolean in java - what about python?
                            
                                dynamically adding and removing widgets in PyQt
                            
                                Getting confused with lambda and list comprehension
                            
                                How to install gevent on Windows?
                            
                                Getting FFProbe Information With Python
                            
                                IPython import failure and python sys.path in general

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With