Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse HTML and preserve original content

I have lots of HTML files. I want to replace some elements, keeping all the other content unchanged. For example, I would like to execute this jQuery expression (or some equivalent of it):

$('.header .title').text('my new content')

on the following HTML document:

<div class=header><span class=title>Foo</span></div>
<p>1<p>2
<table><tr><td>1</td></tr></table>

and have the following result:

<div class=header><span class=title>my new content</span></div>
<p>1<p>2
<table><tr><td>1</td></tr></table>

The problem is, all parsers I’ve tried (Nokogiri, BeautifulSoup, html5lib) serialize it to something like this:

<html>
  <head></head>
  <body>
    <div class=header><span class=title>my new content</span></div>
    <p>1</p><p>2</p>
    <table><tbody><tr><td>1</td></tr></tbody></table>
  </body>
</html>

E.g. they add:

  1. html, head and body elements
  2. closing p tags
  3. tbody

Is there a parser that satisfies my needs? It should work in either Node.js, Ruby or Python.

like image 422
NVI Avatar asked Jul 20 '12 09:07

NVI


People also ask

What does parsing HTML do?

Parsing means analyzing and converting a program into an internal format that a runtime environment can actually run, for example the JavaScript engine inside browsers. The browser parses HTML into a DOM tree. HTML parsing involves tokenization and tree construction.

Can you parse HTML?

HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.


1 Answers

I highly recommend the pyquery package, for python. It is a jquery-like interface layered ontop of the extremely reliable lxml package, a python binding to libxml2.

I believe this does exactly what you want, with a quite familiar interface.

from pyquery import PyQuery as pq
html = '''
<div class=header><span class=title>Foo</span></div>
<p>1<p>2
<table><tr><td>1</td></tr></table>
'''
doc = pq(html)

doc('.header .title').text('my new content')
print doc

Output:

<div><div class="header"><span class="title">my new content</span></div>
<p>1</p><p>2
</p><table><tr><td>1</td></tr></table></div>

The closing p tag can't be helped. lxml only keeps the values from the original document, not the vagaries of the original. Paragraphs can be made two ways, and it chooses the more standard way when doing serialization. I don't believe you'll find a (bug-free) parser that does better.

like image 52
bukzor Avatar answered Sep 18 '22 12:09

bukzor