I'm trying to use python (with pyquery and lxml) to alter and clean up some html.
Eg. html = "<div><!-- word style><bleep><omgz 1,000 tags><--><p>It’s a spicy meatball!</div>"
The lxml.html.clean function, clean_html(), works well, except that it replaces the nice html entities like
’
with some unicode string
\xc2\x92
The unicode looks strange in different browsers (firefox and opera using auto encoding, utf8, latin-1, etc), like an empty box. How can I stop lxml converting the entities? How can I get it all in latin-1 encoding? Seems strange that a module built specifically for html would do this.
I can't be sure of which characters are there, so I can't just use
replace("\xc2\x92","’").
I've tried using
clean_html(html).encode('latin-1')
but the unicode persists.
And yes, I'd tell people to stop using word to write html, but then I'd hear the whole
"iz th wayz i liks it u cant mak me chang hitlr".
Edit: a beautifulsoup solution:
from BeautifulSoup import BeautifulSoup, Comment
soup = BeautifulSoup(str(desc[desc_type]))
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
print soup
lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).
lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.
lxml has been downloaded from the Python Package Index millions of times and is also available directly in many package distributions, e.g. for Linux or macOS.
There are a few things that - if you know them - will lead to the easiest/best solution:
clean_html()
returns the same type you provide it with: if you give it a string, it will return a string, but if you give it an Element or ElementTree, it will return an Element or ElementTree respectively
you can control the way an Element or ElementTree is serialized, by giving encoding options to lxml.html.tostring()
method or the tree's write()
method (same goes for xml by the way). You can do this with encoding='utf-8'
for example.
any content that CAN be encoded in that encoding, will be output as an encoded string, any content that cannot will be "escaped" as entities. Using encoding="ascii"
will force any non-ascii characters to "nice" entities like you wish.
Put together, this means: first parse the string into an element (or tree if you wish), clean it, and serialize it as needed:
html = lxml.html.fromstring("<div><!-- word style><bleep><omgz 1,000 tags><--><p>It’s a spicy meatball!</div>")
html = clean_html(html)
result = lxml.html.tostring(html, encoding="ascii")
(and a slightly dirtier trick is to use the errors parameter on the encode()
method of a unicode string: try encoding a unicode string containing "special" characters with s.encode('ascii', 'xmlcharrefreplace')
and see what that does...)
I assume ’
is supposed to be a quotation mark. The str object with byte value 146, chr(146)
, decoded with cp1252
is a quotation mark:
In [46]: print(chr(146).decode('cp1252'))
’
So, you could do this:
import lxml.html.clean as clean
import re
html = "<div><!-- word style><bleep><omgz 1,000 tags><--><p>It’s a spicy meatball!</div>"
html=re.sub('&#(\d+);',lambda m: chr(int(m.group(1))).decode('cp1252'),html)
print(html)
# <div><!-- word style><bleep><omgz 1,000 tags><--><p>It’s a spicy meatball!</div>
print(type(html))
# <type 'unicode'>
print(clean.clean_html(html))
# <div><p>It’s a spicy meatball!</p></div>
Or,
doc=lh.fromstring(html)
clean.clean(doc)
Note that the quotation mark has unicode code point value 8217. That is, ord(chr(146).decode('cp1252'))
equals 8217, so lh.tostring
returns:
print(lh.tostring(doc))
# <div><p>It’s a spicy meatball!</p></div>
You could re-encode it in cp1252 like this:
print(repr(lh.tostring(doc,encoding='cp1252')))
# '<div><p>It\x92s a spicy meatball!</p></div>'
I don't know how to coax lxml to return
'<div><p>It’s a spicy meatball!</p></div>'
to match the output of your BeautifulSoup code, however. Well, clearly it could be done with regex (reversing what I did above), but I don't know if that is necessary or advisable, since lxml should already be returning html that other applications can understand.
result=re.sub('&#(\d+);',lambda m: '&#{n};'.format(
n=ord(unichr(int(m.group(1))).encode('cp1252'))),
lh.tostring(doc))
print(result)
# <div><p>It’s a spicy meatball!</p></div>
You could also just convert the utf8 string into ascii with xml characters
result = result.decode('utf-8').encode('ascii', 'xmlcharrefreplace')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With