By default lxml doesn't understsand the wbr tag, used to add word-breaks in long words. It formats it as <wbr></wbr>
when it should be formatted simply as <wbr>
, similar to the br tag.
How do I add this behavior to lxml?
Actually it is not difficult to patch libxml2 (this walkthrough was done on Ubuntu 11.04 with Python 2.7.3)
First define a test program wbr_test.py
:
from lxml import etree
from cStringIO import StringIO
wbr_html = """\
<html>
<head>
<title>wbr test</title>
</head>
<body>
Test for a breakable<wbr>word implemenation change
</body>
</html>
"""
parser = etree.HTMLParser()
tree = etree.parse(StringIO(wbr_html), parser)
result = etree.tostring(tree.getroot(),
pretty_print=True, method="html")
if result.split() != wbr_html.split(): # split, as we are not interested in whitespace differences
print(result)
print("not ok")
else:
print("OK")
Make sure that it fails by running python wbr_test.py
. It should insert a <\wbr>
before
<\body>
, and print not ok
at the end.
Download, extract and compile libxml2
:
wget ftp://xmlsoft.org/libxml2/libxml2-2.8.0.tar.gz
tar xvf libxml2-2.8.0.tar.gz
cd libxml2-2.8.0/
./configure --prefix=/usr
make -j8 # adjust number to match your number of cores
Install, and install python libxml2 bindings:
sudo make install
cd to_python_bindings
sudo python setup.py install
Test your wbr_test.py
once more, to make sure it fails with the latest libxml2 version.
First make a copy of HTMLparser.c
e.g. in /var/tmp
.
Now edit the the file HTMLparser.c at the toplevel of the libxml2 source. Search for the word forced
(only one occurrence). You will be at the <br>
tag definition. Copy the three lines starting with the line you just found. The most appropriate insert point is just before the end (after the definition of <var>
). To get the final comma right in the table insert the three lines before the one with just '}'
not the one with '};'
.
In the newly inserted code Replace br
with wbr
and change DECL clear_attrs
to NULL
(assuming that a new tag does not have deprecated attributes).
The result should diff with the version in /var/tmp
( diff -u HTMLparser.c /var/tmp
) as follows:
@@ -1039,6 +1039,9 @@
},
{ "var", 0, 0, 0, 0, 0, 0, 1, "instance of a variable or program argument",
DECL html_inline, NULL, DECL html_attrs, NULL, NULL
+},
+{ "wbr", 0, 2, 2, 1, 0, 0, 1, "possible line break ",
+ EMPTY , NULL , DECL core_attrs, NULL , NULL
}
};
Make and install:
make && sudo make install
Test your wbr_test.py
once more. Should show OK
Since <wbr>
only exists in HTML5, I suspect the Right Thing to do is use lxml.html.html5parser
.
Short of that, the list of empty tags is defined in regular Python code, so you could always just monkeypatch it; see lxml.html.defs.empty_tags. Patches are welcome, I'm sure. :)
As a quick fix, why not use the replace
method of strings to remove the close tags?
>>> t = 'Thisisa<wbr></wbr>test'
>>> t.replace('</wbr>', '')
'Thisisa<wbr>test'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With