lxml unicode characters

Question

I'm new to lxml and python. I'm trying to parse an html document. When I parse using the standard xml parser it will write the characters out correctly but I think it fails to parse as I have trouble searching it with xpath.

Example file being parsed:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <title>title</title>
</head>
<body>
    <span id="demo">Garbléd charactérs</span>
</body>
</html>

Parsing code:

from lxml import etree

fname = 'output/so-help.html'

# parse
hparser = etree.HTMLParser()
htree   = etree.parse(fname, hparser)

# garbled
htree.write('so-dumpu.html', encoding='utf-8')

# targets
demo_name = htree.xpath("//span[@id='demo']")

# garbled
print 'name: "' + demo_name[0].text

Terminal output:

name: "GarblÃ©d charactÃ©rs

htree.write output:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>title</title></head><body>
    <span id="demo">GarblÃ©d charactÃ©rs</span>
</body></html>

Emil M · Accepted Answer

the problem was that you tried to encode an already encoded data, what you need is to let parser decode the data with utf-8. * in your original code try demo_name[0].text.decode('utf-8') and you will see

the right way to do it :

from lxml import etree

fname = 'output/so-help.html'

# parse
hparser = etree.HTMLParser(encoding='utf-8')
htree   = etree.parse(fname, hparser)

# garbled
htree.write('so-dumpu.html')

# targets
demo_name = htree.xpath("//span[@id='demo']")

# garbled
print 'name: "' + demo_name[0].text

dusan · Answer

Try changing the output encoding:

htree.write('so-dumpu.html', encoding='latin1')

and

print 'name: "' + demo_name[0].text.encode('latin1')

Xion345 · Answer

I assume your XHTML document is encoded in utf-8. The issue is that the encoding is not specified in the HTML document. By default, browsers and lxml.html assume HTML documents are encoded in ISO-8859-1, that's why your document is incorrectly parsed. If you open it in your browser, it will also be displayed incorrectly.

You can specify the encoding of your document like this :

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <title>title</title>
    <meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>

You can force the encoding used by lxml this way (like your can change the encoding used in your browser) :

file = open(fname)
filecontents = file.read()
filecontents = filecontents.decode("utf-8")
htree = lxml.html.fromstring(filecontents)
print htree.xpath("//span[@id='demo']")[0].text

lxml unicode characters

Tags:

python

encoding

lxml

ryan

3 Answers

Emil M

dusan

Xion345

Recent Activity

Donate For Us

lxml unicode characters

Tags:

python

encoding

lxml

ryan

3 Answers

Emil M

dusan

Xion345

Related questions

Recent Activity

Donate For Us