<p>I've got an html file that has some text that looks like this (after running it through <code>lxml.html parse</code>, <code>lxml.html clean</code>, and this is the result of <code>etree.tostring(table, pretty_print=True)</code>)</p> <pre class="prettyprint"><code> <tr><td>&#13; 224&#13; 9:00 am&#13; -3:00 pm&#13; NPHC Leadership</td>&#13; <td>&#13; <font>ALSO IN 223; WALL OPEN</font></td>&#13; </code></pre> <p>The documentation that I've found on lxml has been somewhat spotty. I've been able to do quite a bit to get to this point, but what I would like to do is strip out all the tags except <code><table></code>, <code><td></code>, and <code><tr></code>. I would also like to strip all the attributes from those tags, and I would also like to get rid of the entities, such as <code>&#13;</code>.</p> <p>To strip the attributes currently I use:</p> <pre class="prettyprint"><code> etree.strip_attributes(tree, 'width', 'href', 'style', 'onchange', 'ondblclick', 'class', 'colspan', 'cols', 'border', 'align', 'color', 'value', 'cellpadding', 'nowrap', 'selected', 'cellspacing') </code></pre> <p>which works fine, but it seems like there should be a better way. It seems like there should be some fairly simple methods to do what I want, but I haven't been able to find any examples that worked right for me. </p> <p>I tried using <code>Cleaner</code>, but when I passed it <code>allow_tags</code>, like this:</p> <p>error: <code>Cleaner(allow_tags=['table', 'td', 'tr']).clean_html(tree)</code> it gave me this error:</p> <p><code>ValueError: It does not make sense to pass in both allow_tags and remove_unknown_tags</code>. Also, when I add <code>remove_unkown_tags=False</code> I get this error:</p> <pre class="prettyprint"><code>Traceback (most recent call last): File "parse.py", line 73, in <module> SParser('schedule.html').test() File "parse.py", line 38, in __init__ self.clean() File "parse.py", line 42, in clean Cleaner(allow_tags=['table', 'td', 'tr'], remove_unknown_tags=False).clean_html(tree) File "/usr/lib/python2.6/dist-packages/lxml/html/clean.py", line 488, in clean_html self(doc) File "/usr/lib/python2.6/dist-packages/lxml/html/clean.py", line 390, in __call__ el.drop_tag() File "/usr/lib/python2.6/dist-packages/lxml/html/__init__.py", line 191, in drop_tag assert parent is not None AssertionError </code></pre> <p>So, to sum up:</p> <ol> <li>I want to remove HTML entities, such as <code>&#13;</code> </li> <li>I want to remove all tags except <code><table></code>, <code><tr></code>, and <code><td></code> </li> <li>I want to remove all the attributes from the remaining tags.</li> </ol> <p>Any help would be greatly appreciated!</p>

<p>Here is an example of stripping out all attributes and allowing only tags in <code>[table, tr, td]</code>. I've added a few Unicode entities for sake of illustration. </p> <pre class="prettyprint"><code>DATA = '''<table border="1"><tr colspan="4"><td rowspan="2">\r 224&#13; &#8220;hi there&#8221; 9:00 am\r -3:00 pm&#13; NPHC Leadership</td>\r <td rowspan="2">\r <font>ALSO IN 223; WALL OPEN</font></td>\r </table>''' import lxml.html from lxml.html import clean def _clean_attrib(node): for n in node: _clean_attrib(n) node.attrib.clear() tree = lxml.html.fromstring(DATA) cleaner = clean.Cleaner(allow_tags=['table','tr','td'], remove_unknown_tags=False) cleaner.clean_html(tree) _clean_attrib(tree) print lxml.html.tostring(tree, encoding='utf-8', pretty_print=True, method='html') </code></pre> <p>Result:</p> <pre class="prettyprint"><code><table><tr> <td> 224 “hi there” 9:00 am -3:00 pm NPHC Leadership</td> <td> <font>ALSO IN 223; WALL OPEN</font> </td> </tr></table> </code></pre> <p>Are you sure you want to strip out all entities? The <code>&#13;</code> corresponds to a carriage return, and when lxml parses the document it converts all entities to their corresponding Unicode characters.</p> <p>Whether entities show up is also dependent on the output method and encoding. For example, if you use <code>lxml.html.tostring(encoding='ascii', method='xml')</code> the <code>'\r'</code> and Unicode characters will be output as entities:</p> <pre class="prettyprint"><code><table> <tr><td>&#13; &#8220;hi there&#8221; ... </code></pre>

How do I remove html entities (and more) using lxml?

Tags:

python

html-parsing

lxml

I've got an html file that has some text that looks like this (after running it through lxml.html parse, lxml.html clean, and this is the result of etree.tostring(table, pretty_print=True))

 <tr><td>&#13;
224&#13;
9:00 am&#13;
-3:00 pm&#13;
NPHC Leadership</td>&#13;
<td>&#13;
<font>ALSO IN 223; WALL OPEN</font></td>&#13;

The documentation that I've found on lxml has been somewhat spotty. I've been able to do quite a bit to get to this point, but what I would like to do is strip out all the tags except <table>, <td>, and <tr>. I would also like to strip all the attributes from those tags, and I would also like to get rid of the entities, such as .

To strip the attributes currently I use:

    etree.strip_attributes(tree, 'width', 'href', 'style', 'onchange',
                           'ondblclick', 'class', 'colspan', 'cols',
                           'border', 'align', 'color', 'value',
                           'cellpadding', 'nowrap', 'selected',
                           'cellspacing')

which works fine, but it seems like there should be a better way. It seems like there should be some fairly simple methods to do what I want, but I haven't been able to find any examples that worked right for me.

I tried using Cleaner, but when I passed it allow_tags, like this:

error: Cleaner(allow_tags=['table', 'td', 'tr']).clean_html(tree) it gave me this error:

ValueError: It does not make sense to pass in both allow_tags and remove_unknown_tags. Also, when I add remove_unkown_tags=False I get this error:

Traceback (most recent call last):
  File "parse.py", line 73, in <module>
    SParser('schedule.html').test()
  File "parse.py", line 38, in __init__
    self.clean()
  File "parse.py", line 42, in clean
    Cleaner(allow_tags=['table', 'td', 'tr'], remove_unknown_tags=False).clean_html(tree)
  File "/usr/lib/python2.6/dist-packages/lxml/html/clean.py", line 488, in clean_html
    self(doc)
  File "/usr/lib/python2.6/dist-packages/lxml/html/clean.py", line 390, in __call__
    el.drop_tag()
  File "/usr/lib/python2.6/dist-packages/lxml/html/__init__.py", line 191, in drop_tag
    assert parent is not None
AssertionError

So, to sum up:

I want to remove HTML entities, such as 
I want to remove all tags except <table>, <tr>, and <td>
I want to remove all the attributes from the remaining tags.

Any help would be greatly appreciated!

976

asked May 03 '11 20:05

Wayne Werner

2 Answers

Here is an example of stripping out all attributes and allowing only tags in [table, tr, td]. I've added a few Unicode entities for sake of illustration.

DATA = '''<table border="1"><tr colspan="4"><td rowspan="2">\r
224&#13;
&#8220;hi there&#8221;
9:00 am\r
-3:00 pm&#13;
NPHC Leadership</td>\r
<td rowspan="2">\r
<font>ALSO IN 223; WALL OPEN</font></td>\r
</table>'''

import lxml.html
from lxml.html import clean

def _clean_attrib(node):
    for n in node:
        _clean_attrib(n)
    node.attrib.clear()

tree = lxml.html.fromstring(DATA)
cleaner = clean.Cleaner(allow_tags=['table','tr','td'],
                        remove_unknown_tags=False)
cleaner.clean_html(tree)
_clean_attrib(tree)

print lxml.html.tostring(tree, encoding='utf-8', pretty_print=True, 
                         method='html')

Result:

<table><tr>
<td>
224
“hi there”
9:00 am
-3:00 pm
NPHC Leadership</td>
<td>
<font>ALSO IN 223; WALL OPEN</font>
</td>
</tr></table>

Are you sure you want to strip out all entities? The  corresponds to a carriage return, and when lxml parses the document it converts all entities to their corresponding Unicode characters.

Whether entities show up is also dependent on the output method and encoding. For example, if you use lxml.html.tostring(encoding='ascii', method='xml') the '\r' and Unicode characters will be output as entities:

<table>
  <tr><td>&#13;
  &#8220;hi there&#8221;
...

161

answered Sep 28 '22 14:09

samplebias

For me, I find writing it based on the basic elements of text, tag and tail makes it much easier to specialize the behaviour to what you want and include error checking (eg to ensure there are no unexpected tags in the incoming data).

The if statements on the text and tail are because they return None rather than "" when zero length.

def ctext(el):
    result = [ ]
    if el.text:
        result.append(el.text)
    for sel in el:
        if sel.tag in ["tr", "td", "table"]:
            result.append("<%s>" % sel.tag)
            result.append(ctext(sel))
            result.append("</%s>" % sel.tag)
        else:
            result.append(ctext(sel))
        if sel.tail:
            result.append(sel.tail)
    return "".join(result)

html = """your input string"""
el = lxml.html.fromstring(html)
print ctext(el)

Remember the relationship is:

  <b>text of the bold <i>text of the italic</i> tail of the italic</b>

answered Sep 28 '22 15:09

Julian Todd

Related questions
                            
                                Gradient facecolor matplotlib bar plot
                            
                                How to use Twisted to get an IRC channel's user list
                            
                                Pathological regex that blows up (time & memory)?
                            
                                Pygame - making a sprite move in the direction it is facing
                            
                                python floating point nature and converting to a smaller type
                            
                                Numpy table - advanced multiple criteria selection
                            
                                access numpy array from a functional language
                            
                                Matplotlib Unicode axis labels using the Cairo renderer
                            
                                How to order choices based on their name instead of their index?
                            
                                Example google app engine (python, Django) websites with open source [closed]
                            
                                shutil moving files keeping the same directory structure
                            
                                Mechanize not working for automating gmail login in Google Appengine
                            
                                PyCUDA: Pow within device code tries to use std::pow, fails
                            
                                Python RSS Parser that also handles FeedBurner
                            
                                Get root dialog in Python on Mac OS X, Windows?
                            
                                How can I change values for a section of a numpy array?
                            
                                Use .Net (C#) dll in Python script
                            
                                Unable to load DLL python module in PyCharm. Works fine in IPython
                            
                                I want to retrieve multiple logs from a server with minimum lines of code
                            
                                sqlalchemy identity map question

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With