Python correct encoding of Website (Beautiful Soup)

Tags:

I am trying to load a html-page and output the text, even though i am getting the webpage correctly, BeautifulSoup destroys somehow the encoding.

Source:

# -*- coding: utf-8 -*-
import requests
from BeautifulSoup import BeautifulSoup

url = "http://www.columbia.edu/~fdc/utf8/"
r = requests.get(url)

encodedText = r.text.encode("utf-8")
soup = BeautifulSoup(encodedText)
text =  str(soup.findAll(text=True))
print text.decode("utf-8")

Excerpt Output:

...Odenw\xc3\xa4lderisch...

this should be Odenwälderisch

630

asked Apr 25 '16 06:04

user1767754

2 Answers

You are making two mistakes; you are mis-handling encoding, and you are treating a result list as something that can safely be converted to a string without loss of information.

First of all, don't use response.text! It is not BeautifulSoup at fault here, you are re-encoding a Mojibake. The requests library will default to Latin-1 encoding for text/* content types when the server doesn't explicitly specify an encoding, because the HTTP standard states that that is the default.

See the Encoding section of the Advanced documentation:

The only time Requests will not do this is if no explicit charset is present in the HTTP headers and the Content-Type header contains text. In this situation, RFC 2616 specifies that the default charset must be ISO-8859-1. Requests follows the specification in this case. If you require a different encoding, you can manually set the Response.encoding property, or use the raw Response.content.

Bold emphasis mine.

Pass in the response.content raw data instead:

soup = BeautifulSoup(r.content)

I see that you are using BeautifulSoup 3. You really want to upgrade to BeautifulSoup 4 instead; version 3 has been discontinued in 2012, and contains several bugs. Install the beautifulsoup4 project, and use from bs4 import BeautifulSoup.

BeautifulSoup 4 usually does a great job of figuring out the right encoding to use when parsing, either from a HTML <meta> tag or statistical analysis of the bytes provided. If the server does provide a characterset, you can still pass this into BeautifulSoup from the response, but do test first if requests used a default:

encoding = r.encoding if 'charset' in r.headers.get('content-type', '').lower() else None
parser = 'html.parser'  # or lxml or html5lib
soup = BeautifulSoup(r.content, parser, from_encoding=encoding)

Last but not least, with BeautifulSoup 4, you can extract all text from a page using soup.get_text():

text = soup.get_text()
print text

You are instead converting a result list (the return value of soup.findAll()) to a string. This never can work because containers in Python use repr() on each element in the list to produce a debugging string, and for strings that means you get escape sequences for anything not a printable ASCII character.

117

answered Oct 23 '22 01:10

Martijn Pieters

It's not BeautifulSoup's fault. You can see this by printing out encodedText, before you ever use BeautifulSoup: the non-ASCII characters are already gibberish.

The problem here is that you are mixing up bytes and characters. For a good overview of the difference, read one of Joel's articles, but the gist is that bytes are, well, bytes (groups of 8 bits without any further meaning attached), whereas characters are the things that make up strings of text. Encoding turns characters into bytes, and decoding turns bytes back into characters.

A look at the requests documentation shows that r.text is made of characters, not bytes. You shouldn't be encoding it. If you try to do so, you will make a byte string, and when you try to treat that as characters, bad things will happen.

There are two ways to get around this:

Use the raw undecoded bytes, which are stored in r.content, as Martijn suggested. Then you can decode them yourself to turn them into characters.
Let requests do the decoding, but just make sure it uses the right codec. Since you know that's UTF-8 in this case, you can set r.encoding = 'utf-8'. If you do this before you access r.text, then when you do access r.text, it will have been properly decoded, and you get a character string. You don't need to mess with character encodings at all.

Incidentally, Python 3 makes it somewhat easier to maintain the difference between character strings and byte strings, because it requires you to use different types of objects to represent them.

answered Oct 23 '22 00:10

David Z

Related questions
                            
                                remove last few characters in PySpark dataframe column
                            
                                How do I do conditional array arithmetic on a numpy array?
                            
                                How to make mysql connection that requires CA-CERT with sqlalchemy or SQLObject
                            
                                Creating a video using OpenCV 2.4.0 in python
                            
                                How to actually upload a file using Flask WTF FileField
                            
                                Issue setting Kivy to fullscreen
                            
                                Unable to connect aws s3 bucket using boto
                            
                                Converting Pandas DataFrame to Orange Table
                            
                                Finding consecutive consonants in a word
                            
                                Wait until task is completed on Remote Machine through Python [duplicate]
                            
                                Select query in pymysql
                            
                                Can't drop NAN with dropna in pandas
                            
                                How to pass a list as an input of a function in Python
                            
                                Django BooleanField as a dropdown
                            
                                Where is the .profile file on mac?
                            
                                Python tutorial code from RabbitMQ failing to run
                            
                                How to list all blobs inside of a specific subdirectory in Azure Cloud Storage using Python?
                            
                                Incrementing the upper limit of range inside a loop doesn't make it run forever [duplicate]
                            
                                What is the difference between model.to(device) and model=model.to(device)?
                            
                                Python , Printing Hex removes first 0?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python correct encoding of Website (Beautiful Soup)

Tags:

python

encoding

beautifulsoup

utf-8

mojibake

user1767754

People also ask

2 Answers

Martijn Pieters

David Z

Recent Activity

Donate For Us