Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

lxml and <wbr> tags

By default lxml doesn't understsand the wbr tag, used to add word-breaks in long words. It formats it as <wbr></wbr> when it should be formatted simply as <wbr>, similar to the br tag.

How do I add this behavior to lxml?

like image 376
bukzor Avatar asked Apr 26 '12 21:04

bukzor


3 Answers

Actually it is not difficult to patch libxml2 (this walkthrough was done on Ubuntu 11.04 with Python 2.7.3)

First define a test program wbr_test.py:

from lxml import etree
from cStringIO import StringIO

wbr_html = """\
<html>
  <head>
    <title>wbr test</title>
  </head>
<body>
  Test for a breakable<wbr>word implemenation change
</body>
</html>
"""

parser = etree.HTMLParser()
tree   = etree.parse(StringIO(wbr_html), parser)

result = etree.tostring(tree.getroot(),
                         pretty_print=True, method="html")
if result.split() != wbr_html.split(): # split, as we are not interested in whitespace differences
    print(result)
    print("not ok")
else:
    print("OK")

Make sure that it fails by running python wbr_test.py. It should insert a <\wbr> before <\body>, and print not ok at the end.

Download, extract and compile libxml2:

wget ftp://xmlsoft.org/libxml2/libxml2-2.8.0.tar.gz
tar xvf libxml2-2.8.0.tar.gz 
cd libxml2-2.8.0/
./configure --prefix=/usr
make -j8  # adjust number to match your number of cores

Install, and install python libxml2 bindings:

sudo make install
cd to_python_bindings
sudo python setup.py install

Test your wbr_test.py once more, to make sure it fails with the latest libxml2 version.

First make a copy of HTMLparser.c e.g. in /var/tmp.

Now edit the the file HTMLparser.c at the toplevel of the libxml2 source. Search for the word forced (only one occurrence). You will be at the <br> tag definition. Copy the three lines starting with the line you just found. The most appropriate insert point is just before the end (after the definition of <var>). To get the final comma right in the table insert the three lines before the one with just '}' not the one with '};'.

In the newly inserted code Replace br with wbr and change DECL clear_attrs to NULL (assuming that a new tag does not have deprecated attributes).

The result should diff with the version in /var/tmp ( diff -u HTMLparser.c /var/tmp) as follows:

@@ -1039,6 +1039,9 @@
 },
 { "var",   0, 0, 0, 0, 0, 0, 1, "instance of a variable or program argument",
DECL html_inline, NULL, DECL html_attrs, NULL, NULL
+},
+{ "wbr",   0, 2, 2, 1, 0, 0, 1, "possible line break ",
+   EMPTY , NULL , DECL core_attrs, NULL , NULL
 }
 };

Make and install:

make && sudo make install

Test your wbr_test.py once more. Should show OK

like image 74
Anthon Avatar answered Nov 07 '22 20:11

Anthon


Since <wbr> only exists in HTML5, I suspect the Right Thing to do is use lxml.html.html5parser.

Short of that, the list of empty tags is defined in regular Python code, so you could always just monkeypatch it; see lxml.html.defs.empty_tags. Patches are welcome, I'm sure. :)

like image 42
Eevee Avatar answered Nov 07 '22 20:11

Eevee


As a quick fix, why not use the replace method of strings to remove the close tags?

>>> t = 'Thisisa<wbr></wbr>test'
>>> t.replace('</wbr>', '')
'Thisisa<wbr>test'
like image 1
Roland Smith Avatar answered Nov 07 '22 21:11

Roland Smith