HTML encoding and lxml parsing

Tags:

I'm trying to finally solve some encoding issues that pop up from trying to scrape HTML with lxml. Here are three sample HTML documents that I've encountered:

<!DOCTYPE html>
<html lang='en'>
<head>
   <title>Unicode Chars: 은 —’</title>
   <meta charset='utf-8'>
</head>
<body></body>
</html>

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ko-KR" lang="ko-KR">
<head>
    <title>Unicode Chars: 은 —’</title>
    <meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
<body></body>
</html>

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Unicode Chars: 은 —’</title>
</head>
<body></body>
</html>

My basic script:

from lxml.html import fromstring
...

doc = fromstring(raw_html)
title = doc.xpath('//title/text()')[0]
print title

The results are:

Unicode Chars: ì ââ
Unicode Chars: 은 —’
Unicode Chars: 은 —’

So, obviously an issue with sample 1 and the missing <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> tag. The solution from here will correctly recognize sample 1 as utf-8 and so it is functionally equivalent to my original code.

The lxml docs appear conflicted:

From here the example seems to suggest we should use UnicodeDammit to encode the markup as unicode.

from BeautifulSoup import UnicodeDammit

def decode_html(html_string):
    converted = UnicodeDammit(html_string, isHTML=True)
    if not converted.unicode:
        raise UnicodeDecodeError(
            "Failed to detect encoding, tried [%s]",
            ', '.join(converted.triedEncodings))
    # print converted.originalEncoding
    return converted.unicode

root = lxml.html.fromstring(decode_html(tag_soup))

However here it says:

[Y]ou will get errors when you try [to parse] HTML data in a unicode string that specifies a charset in a meta tag of the header. You should generally avoid converting XML/HTML data to unicode before passing it into the parsers. It is both slower and error prone.

If I try to follow the the first suggestion in the lxml docs, my code is now:

from lxml.html import fromstring
from bs4 import UnicodeDammit
...
dammit = UnicodeDammit(raw_html)
doc = fromstring(dammit.unicode_markup)
title = doc.xpath('//title/text()')[0]
print title

I now get the following results:

Unicode Chars: 은 —’
Unicode Chars: 은 —’
ValueError: Unicode strings with encoding declaration are not supported.

Sample 1 now works correctly but sample 3 results in an error due to the <?xml version="1.0" encoding="utf-8"?> tag.

Is there a correct way to handle all of these cases? Is there a better solution than the following?

dammit = UnicodeDammit(raw_html)
try:
    doc = fromstring(dammit.unicode_markup)
except ValueError:
    doc = fromstring(raw_html)

780

asked Mar 08 '13 19:03

bismark

1 Answers

lxml has several issues related to handling Unicode. It might be best to use bytes (for now) while specifying the character encoding explicitly:

#!/usr/bin/env python
import glob
from lxml import html
from bs4 import UnicodeDammit

for filename in glob.glob('*.html'):
    with open(filename, 'rb') as file:
        content = file.read()
        doc = UnicodeDammit(content, is_html=True)

    parser = html.HTMLParser(encoding=doc.original_encoding)
    root = html.document_fromstring(content, parser=parser)
    title = root.find('.//title').text_content()
    print(title)

Output

Unicode Chars: 은 —’
Unicode Chars: 은 —’
Unicode Chars: 은 —’

167

answered Nov 08 '22 10:11

jfs

Related questions
                            
                                Is it reasonable in Python to check for a specific type of exception using isinstance?
                            
                                Append a text to file in Python [duplicate]
                            
                                Mongoengine... query something not in a ListField?
                            
                                How do I redirect to the www. version of my Flask site on Heroku?
                            
                                Is there any way to delete label or button from tkinter window and then add it back?
                            
                                Django datefield and timefield to python datetime
                            
                                Python - Batch convert GPS positions to Lat Lon decimals
                            
                                Creating dynamically named variables from user input [duplicate]
                            
                                Export a Python List to Excel
                            
                                Compare Images in Python
                            
                                Parse birth and death dates from Wikipedia?
                            
                                How to prevent automatic escaping of special characters in Python
                            
                                How to change ttk.progressBar color in python
                            
                                pymongo: MongoClient or Connection
                            
                                How to break while loop in an inner for loop in python?
                            
                                Is there a random number distribution that obeys Benford's Law?
                            
                                argparse choices structure of allowed values
                            
                                input() :: Using Backspace And Arrow Keys
                            
                                mongoengine - query how to filter by ListField size
                            
                                python socket programming OSError: [WinError 10038] an operation was attempted on something that is not a socket

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

HTML encoding and lxml parsing

Tags:

python

unicode

beautifulsoup

web-scraping

lxml

bismark

People also ask

1 Answers

Output

jfs

Recent Activity

Donate For Us