Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup Prettify fails on copyright symbol

I am getting a Unicode error: UnicodeEncodeError: 'charmap' codec can't encode character u'\xa9' in position 822: character maps to <undefined>

This appears to be a standard copyright symbol, and in the HTML is &copy. I have not been able to find a way past this. I even tried a custom function to replace copy with a space but that also failed with the same error.

import sys
import pprint
import mechanize
import cookielib
from bs4 import BeautifulSoup
import html2text
import lxml

def MakePretty():

def ChangeCopy(S):
    return S.replace(chr(169)," ")
br = mechanize.Browser()

# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

# Browser options
br.set_handle_equiv(True)
#br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)

# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)

# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

# The site we will navigate into, handling its session
# Open the site
br.open('http://www.thesitewizard.com/faqs/copyright-symbol.shtml')
html = br.response().read()
soup = BeautifulSoup(html)
print soup.prettify()

if __name__ == '__main__':
    MakePretty()

How do I get prettify past the copyright symbol? I have searched all over the web for a solution to no avail (or I might not understand as I am fairly new to Python and scraping).

Thanks for your help.

like image 661
user1521411 Avatar asked Jul 12 '12 17:07

user1521411


People also ask

What is the use of soup prettify?

The prettify() method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string: Python3.

What is beautifulsoup4 in Python?

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Which of the following objects of BeautifulSoup is not editable?

BeautifulSoupD. ParserCorrect Option : BEXPLANATION : You cannot edit the Navigable String object but can convert it into a Unicode stringusing the function Unicode.


2 Answers

I had the same problem. This may work for you:

print soup.prettify().encode('UTF-8')

like image 105
kmote Avatar answered Oct 19 '22 02:10

kmote


The page http://www.thesitewizard.com/faqs/copyright-symbol.shtml is sent without specifying character encoding. The page itself specifies the encoding as ISO-8859-1 in a meta tag, but only after the occurrence of the “©” character. So clients have to make a guess, and the guess may be wrong. If the client guesses UTF-8, then it will see the bit A9, which is a data error in UTF-8 data.

So it seems that you need to set the encoding (to ISO-8859-1, or more safely to windows-1252) when reading the data. This is of course an ad hoc solution only; it makes no sense to fix the encoding in general.

like image 39
Jukka K. Korpela Avatar answered Oct 19 '22 02:10

Jukka K. Korpela