Using python to edit html, but lxml converts nice html entities to strange encoding

Tags:

I'm trying to use python (with pyquery and lxml) to alter and clean up some html.

Eg. html = "<div><!-- word style><bleep><omgz 1,000 tags><--><p>It&#146;s a spicy meatball!</div>"

The lxml.html.clean function, clean_html(), works well, except that it replaces the nice html entities like

&#146;

with some unicode string

\xc2\x92

The unicode looks strange in different browsers (firefox and opera using auto encoding, utf8, latin-1, etc), like an empty box. How can I stop lxml converting the entities? How can I get it all in latin-1 encoding? Seems strange that a module built specifically for html would do this.

I can't be sure of which characters are there, so I can't just use

replace("\xc2\x92","&#146;").

I've tried using

clean_html(html).encode('latin-1')

but the unicode persists.

And yes, I'd tell people to stop using word to write html, but then I'd hear the whole

"iz th wayz i liks it u cant mak me chang hitlr".

Edit: a beautifulsoup solution:

from BeautifulSoup import BeautifulSoup, Comment
soup = BeautifulSoup(str(desc[desc_type]))
                    comments = soup.findAll(text=lambda text:isinstance(text, Comment))
                    [comment.extract() for comment in comments]
                    print soup

620

asked Feb 02 '11 16:02

s hanley

3 Answers

There are a few things that - if you know them - will lead to the easiest/best solution:

clean_html() returns the same type you provide it with: if you give it a string, it will return a string, but if you give it an Element or ElementTree, it will return an Element or ElementTree respectively
you can control the way an Element or ElementTree is serialized, by giving encoding options to lxml.html.tostring() method or the tree's write() method (same goes for xml by the way). You can do this with encoding='utf-8' for example.
any content that CAN be encoded in that encoding, will be output as an encoded string, any content that cannot will be "escaped" as entities. Using encoding="ascii" will force any non-ascii characters to "nice" entities like you wish.

Put together, this means: first parse the string into an element (or tree if you wish), clean it, and serialize it as needed:

html = lxml.html.fromstring("<div><!-- word style><bleep><omgz 1,000 tags><--><p>It&#146;s a spicy meatball!</div>")
html = clean_html(html)
result = lxml.html.tostring(html, encoding="ascii")

(and a slightly dirtier trick is to use the errors parameter on the encode() method of a unicode string: try encoding a unicode string containing "special" characters with s.encode('ascii', 'xmlcharrefreplace') and see what that does...)

answered Nov 15 '22 17:11

Steven

I assume  is supposed to be a quotation mark. The str object with byte value 146, chr(146), decoded with cp1252 is a quotation mark:

In [46]: print(chr(146).decode('cp1252'))
’

So, you could do this:

import lxml.html.clean as clean
import re

html = "<div><!-- word style><bleep><omgz 1,000 tags><--><p>It&#146;s a spicy meatball!</div>"

html=re.sub('&#(\d+);',lambda m: chr(int(m.group(1))).decode('cp1252'),html)
print(html)
# <div><!-- word style><bleep><omgz 1,000 tags><--><p>It’s a spicy meatball!</div>
print(type(html))
# <type 'unicode'>
print(clean.clean_html(html))
# <div><p>It’s a spicy meatball!</p></div>

Or,

doc=lh.fromstring(html)
clean.clean(doc)

Note that the quotation mark has unicode code point value 8217. That is, ord(chr(146).decode('cp1252')) equals 8217, so lh.tostring returns:

print(lh.tostring(doc))
# <div><p>It&#8217;s a spicy meatball!</p></div>

You could re-encode it in cp1252 like this:

print(repr(lh.tostring(doc,encoding='cp1252')))
# '<div><p>It\x92s a spicy meatball!</p></div>'

I don't know how to coax lxml to return

'<div><p>It&#146;s a spicy meatball!</p></div>'

to match the output of your BeautifulSoup code, however. Well, clearly it could be done with regex (reversing what I did above), but I don't know if that is necessary or advisable, since lxml should already be returning html that other applications can understand.

result=re.sub('&#(\d+);',lambda m: '&#{n};'.format(
    n=ord(unichr(int(m.group(1))).encode('cp1252'))),
            lh.tostring(doc))
print(result)
# <div><p>It&#146;s a spicy meatball!</p></div>

answered Nov 15 '22 18:11

unutbu

You could also just convert the utf8 string into ascii with xml characters

result = result.decode('utf-8').encode('ascii', 'xmlcharrefreplace')

answered Nov 15 '22 17:11

Laurence Rowe

Related questions
                            
                                Server Logging - in Database or Logfile?
                            
                                How can I use Facebook Connect with Google App Engine without using Django?
                            
                                Getting system status in python
                            
                                how to get console output from a remote computer (ssh + python)
                            
                                Are Python's bools passed by value?
                            
                                How to use FFmpeg
                            
                                diff for single lines
                            
                                What is the Java Equivalent of Python's property()?
                            
                                Strip whitespace in generated HTML using pure Python code
                            
                                Is AMQP production ready?
                            
                                Passing param to DB .execute for WHERE IN... INT list
                            
                                How to write a pager for Python iterators?
                            
                                How to connect a variable to Entry widget?
                            
                                Unicode filename to python subprocess.call() [duplicate]
                            
                                Selenium Webdriver example in Python
                            
                                How does Python's website generate its online documentation?
                            
                                How to use a custom __init__ of an app engine Python model class properly?
                            
                                OAuth 2.0 Tutorial? [closed]
                            
                                How can I make sense of a badly encoded message?
                            
                                Getting the total number of lines in a Tkinter Text widget?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using python to edit html, but lxml converts nice html entities to strange encoding

Tags:

python

character-encoding

html-parsing

lxml