Encoding in python with lxml - complex solution

Tags:

lxml

I need to download and parse webpage with lxml and build UTF-8 xml output. I think schema in pseudocode is more illustrative:

from lxml import etree

webfile = urllib2.urlopen(url)
root = etree.parse(webfile.read(), parser=etree.HTMLParser(recover=True))

txt = my_process_text(etree.tostring(root.xpath('/html/body'), encoding=utf8))


output = etree.Element("out")
output.text = txt

outputfile.write(etree.tostring(output, encoding=utf8))

So webfile can be in any encoding (lxml should handle this). Outputfile have to be in utf-8. I'm not sure where to use encoding/coding. Is this schema ok? (I cant find good tutorial about lxml and encoding, but I can find many problems with this...) I need robust solution.

Edit:

So for sending utf-8 to lxml I use

        converted = UnicodeDammit(webfile, isHTML=True)
        if not converted.unicode:
            print "ERR. UnicodeDammit failed to detect encoding, tried [%s]", \
                ', '.join(converted.triedEncodings)
            continue
        webfile = converted.unicode.encode('utf-8')

802

asked Apr 21 '10 21:04

Vojta Rylko

2 Answers

lxml can be a little wonky about input encodings. It is best to send UTF8 in and get UTF8 out.

You might want to use the chardet module or UnicodeDammit to decode the actual data.

You'd want to do something vaguely like:

import chardet
from lxml import html
content = urllib2.urlopen(url).read()
encoding = chardet.detect(content)['encoding']
if encoding != 'utf-8':
    content = content.decode(encoding, 'replace').encode('utf-8')
doc = html.fromstring(content, base_url=url)

I'm not sure why you are moving between lxml and etree, unless you are interacting with another library that already uses etree?

answered Sep 19 '22 07:09

Ian Bicking

lxml encoding detection is weak.

However, note that the most common problem with web pages is the lack of (or the existence of incorrect) encoding declarations. It is therefore often sufficient to only use the encoding detection of BeautifulSoup, called UnicodeDammit, and to leave the rest to lxml's own HTML parser, which is several times faster.

I recommend to detect encoding using UnicodeDammit and parse using lxml. Also, you can use http header Content-Type (you need to extract charset=ENCODING_NAME) to detect encoding more precisely.

For this example i'm using BeautifulSoup4 (also you have to install chardet for better autodetection, because UnicodeDammit uses chardet internally):

from bs4 import UnicodeDammit

if http_charset == "":
    ud = UnicodeDammit(content, is_html=True)
else:
    ud = UnicodeDammit(content, override_encodings=[http_charset], is_html=True)
root = lxml.html.fromstring(ud.unicode_markup)

OR, to make previous answer more complete, you can modify it to:

if ud.original_encoding != 'utf-8':
    content = content.decode(ud.original_encoding, 'replace').encode('utf-8')

Why this is better than simple using chardet?

You do not ignore Content-Type HTTP header

Content-Type:text/html; charset=utf-8
You do not ignore http-equiv meta tag. Example:

... http-equiv="Content-Type" content="text/html; charset=UTF-8" ...
On top of this, you are using power of chardet, cjkcodecs and iconvcodec codecs and many more.

answered Sep 21 '22 07:09

artyomboyko

Related questions
                            
                                How to create a dictionary using a single list?
                            
                                What's the most space-efficient way to compress serialized Python data?
                            
                                Tensorflow 2: how to switch execution from GPU to CPU and back?
                            
                                RuntimeError: __class__ not set defining 'AbstractBaseUser' as <class 'django.contrib.auth.base_user.Abstract BaseUser'>. Was __classcell__ propagated
                            
                                Maintained alternatives to PyPDF2
                            
                                Setup django with WSGI and apache
                            
                                Nginx + fastcgi truncation problem
                            
                                Python regex findall numbers and dots
                            
                                How do I get the full XML or HTML content of an element using ElementTree?
                            
                                How can I pass a filename as a parameter into my module?
                            
                                str.format() -> how to left-justify
                            
                                Crunching json with python
                            
                                How to explicitly specify a path to Firefox for Selenium?
                            
                                Python - importing package classes into console global namespace
                            
                                Python SSH / SFTP Module?
                            
                                dynamically adding functions to a Python module
                            
                                Python: data vs. text?
                            
                                Using object id as a hash for objects in Python
                            
                                Find system folder locations in Python
                            
                                Python: Class factory using user input as class names

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With