Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using python to edit html, but lxml converts nice html entities to strange encoding

I'm trying to use python (with pyquery and lxml) to alter and clean up some html.

Eg. html = "<div><!-- word style><bleep><omgz 1,000 tags><--><p>It&#146;s a spicy meatball!</div>"

The lxml.html.clean function, clean_html(), works well, except that it replaces the nice html entities like

&#146; 

with some unicode string

\xc2\x92

The unicode looks strange in different browsers (firefox and opera using auto encoding, utf8, latin-1, etc), like an empty box. How can I stop lxml converting the entities? How can I get it all in latin-1 encoding? Seems strange that a module built specifically for html would do this.

I can't be sure of which characters are there, so I can't just use

replace("\xc2\x92","&#146;").

I've tried using

clean_html(html).encode('latin-1')

but the unicode persists.

And yes, I'd tell people to stop using word to write html, but then I'd hear the whole

"iz th wayz i liks it u cant mak me chang hitlr".

Edit: a beautifulsoup solution:

from BeautifulSoup import BeautifulSoup, Comment
soup = BeautifulSoup(str(desc[desc_type]))
                    comments = soup.findAll(text=lambda text:isinstance(text, Comment))
                    [comment.extract() for comment in comments]
                    print soup
like image 620
s hanley Avatar asked Feb 02 '11 16:02

s hanley


People also ask

Can lxml parse HTML?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).

What does lxml do in Python?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.

Is lxml included in Python?

lxml has been downloaded from the Python Package Index millions of times and is also available directly in many package distributions, e.g. for Linux or macOS.


3 Answers

There are a few things that - if you know them - will lead to the easiest/best solution:

  • clean_html() returns the same type you provide it with: if you give it a string, it will return a string, but if you give it an Element or ElementTree, it will return an Element or ElementTree respectively

  • you can control the way an Element or ElementTree is serialized, by giving encoding options to lxml.html.tostring() method or the tree's write() method (same goes for xml by the way). You can do this with encoding='utf-8' for example.

  • any content that CAN be encoded in that encoding, will be output as an encoded string, any content that cannot will be "escaped" as entities. Using encoding="ascii" will force any non-ascii characters to "nice" entities like you wish.

Put together, this means: first parse the string into an element (or tree if you wish), clean it, and serialize it as needed:

html = lxml.html.fromstring("<div><!-- word style><bleep><omgz 1,000 tags><--><p>It&#146;s a spicy meatball!</div>")
html = clean_html(html)
result = lxml.html.tostring(html, encoding="ascii")

(and a slightly dirtier trick is to use the errors parameter on the encode() method of a unicode string: try encoding a unicode string containing "special" characters with s.encode('ascii', 'xmlcharrefreplace') and see what that does...)

like image 83
Steven Avatar answered Nov 15 '22 17:11

Steven


I assume &#146; is supposed to be a quotation mark. The str object with byte value 146, chr(146), decoded with cp1252 is a quotation mark:

In [46]: print(chr(146).decode('cp1252'))
’

So, you could do this:

import lxml.html.clean as clean
import re

html = "<div><!-- word style><bleep><omgz 1,000 tags><--><p>It&#146;s a spicy meatball!</div>"

html=re.sub('&#(\d+);',lambda m: chr(int(m.group(1))).decode('cp1252'),html)
print(html)
# <div><!-- word style><bleep><omgz 1,000 tags><--><p>It’s a spicy meatball!</div>
print(type(html))
# <type 'unicode'>
print(clean.clean_html(html))
# <div><p>It’s a spicy meatball!</p></div>

Or,

doc=lh.fromstring(html)
clean.clean(doc)

Note that the quotation mark has unicode code point value 8217. That is, ord(chr(146).decode('cp1252')) equals 8217, so lh.tostring returns:

print(lh.tostring(doc))
# <div><p>It&#8217;s a spicy meatball!</p></div>   

You could re-encode it in cp1252 like this:

print(repr(lh.tostring(doc,encoding='cp1252')))
# '<div><p>It\x92s a spicy meatball!</p></div>'

I don't know how to coax lxml to return

'<div><p>It&#146;s a spicy meatball!</p></div>'

to match the output of your BeautifulSoup code, however. Well, clearly it could be done with regex (reversing what I did above), but I don't know if that is necessary or advisable, since lxml should already be returning html that other applications can understand.

result=re.sub('&#(\d+);',lambda m: '&#{n};'.format(
    n=ord(unichr(int(m.group(1))).encode('cp1252'))),
            lh.tostring(doc))
print(result)
# <div><p>It&#146;s a spicy meatball!</p></div>
like image 40
unutbu Avatar answered Nov 15 '22 18:11

unutbu


You could also just convert the utf8 string into ascii with xml characters

result = result.decode('utf-8').encode('ascii', 'xmlcharrefreplace')
like image 45
Laurence Rowe Avatar answered Nov 15 '22 17:11

Laurence Rowe