utf8 codec can't decode byte 0x96 in python

Tags:

python

I am trying to check if a certain word is on a page for many sites. The script runs fine for say 15 sites and then it stops.

UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 15344: invalid start byte

I did a search on stackoverflow and found many issues on it but I can't seem to understand what went wrong in my case.

I would like to either solve it or if there is an error skip that site. Pls advice how I can do this as I am new and the below code itself has taken me a day to write. By the way the site which the script halted on was http://www.homestead.com

Click to copy

filetocheck = open("bloglistforcommenting","r") resultfile = open("finalfile","w")  for countofsites in filetocheck.readlines():         sitename = countofsites.strip()         htmlfile = urllib.urlopen(sitename)         page = htmlfile.read().decode('utf8')         match = re.search("Enter your name", page)         if match:             print "match found  : " + sitename             resultfile.write(sitename+"\n")          else:             print "sorry did not find the pattern " +sitename  print "Finished Operations"

As per Mark's comments I changed the code to implement beautifulsoup

Click to copy

htmlfile = urllib.urlopen("http://www.homestead.com") page = BeautifulSoup((''.join(htmlfile))) print page.prettify()

now I am getting this error

Click to copy

page = BeautifulSoup((''.join(htmlfile))) TypeError: 'module' object is not callable

I am trying their quick start example from http://www.crummy.com/software/BeautifulSoup/documentation.html#Quick%20Start. If I copy paste it then the code works fine.

I FINALLY got it to work. Thank you all for your help. Here is the final code.

Click to copy

import urllib import re from BeautifulSoup import BeautifulSoup  filetocheck = open("listfile","r")  resultfile = open("finalfile","w") error ="for errors"  for countofsites in filetocheck.readlines():         sitename = countofsites.strip()         htmlfile = urllib.urlopen(sitename)         page = BeautifulSoup((''.join(htmlfile)))           pagetwo =str(page)          match = re.search("Enter YourName", pagetwo)         if match:             print "match found  : " + sitename             resultfile.write(sitename+"\n")          else:             print "sorry did not find the pattern " +sitename  print "Finished Operations"

727

asked Oct 24 '11 09:10

Vishal Khialani

1 Answers

The byte at 15344 is 0x96. Presumably at position 15343 there is either a single-byte encoding of a character, or the last byte of a multiple-byte encoding, making 15344 the start of a character. 0x96 is in binary 10010110, and any byte matching the pattern 10XXXXXX (0x80 to 0xBF) can only be a second or subsequent byte in a UTF-8 encoding.

Hence the stream is either not UTF-8 or else is corrupted.

Examining the URI you link to, we find the header:

Click to copy

Content-Type: text/html

Since there is no encoding stated, we should use the default for HTTP, which is ISO-8859-1 (aka "Latin 1").

Examining the content we find the line:

Click to copy

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

Which is a fall-back mechanism for people who are, for some reason, unable to set their HTTP headings correctly. This time we are explicitly told the character encoding is ISO-8859-1.

As such, there's no reason to expect reading it as UTF-8 to work.

For extra fun though, when we consider that in ISO-8859-1 0x96 encodes U+0096 which is the control character "START OF GUARDED AREA" we find that ISO-8859-1 isn't correct either. It seems the people creating the page made a similar error to yourself.

From context, it would seem that they actually used Windows-1252, as in that encoding 0x96 encodes U+2013 (EN-DASH, looks like –).

So, to parse this particular page you want to decode in Windows-1252.

More generally, you want to examine headers when picking character encodings, and while it would perhaps be incorrect in this case (or perhaps not, more than a few "ISO-8859-1" codecs are actually Windows-1252), you'll be correct more often. You still need to have something catch failures like this by reading with a fallback. The decode method takes a second parameter called errors. The default is 'strict', but you can also have 'ignore', 'replace', 'xmlcharrefreplace' (not appropriate), 'backslashreplace' (not appropriate) and you can register your own fallback handler with codecs.register_error().

149

answered Sep 20 '22 03:09

Jon Hanna

Related questions
                            
                                Activate virtualenv and run .py script from .bat
                            
                                How to make an Inner Join in django?
                            
                                Controlling Browser using Python?
                            
                                Get process name by PID
                            
                                How to include docs directory in python distribution
                            
                                Python: Extracting bits from a byte
                            
                                How to pass uploaded image to template.html in Flask
                            
                                Using adaptive step sizes with scipy.integrate.ode
                            
                                Recommended place for a Django project to live on Linux
                            
                                How to convert a pandas DataFrame into a TimeSeries?
                            
                                Plot pandas dates in matplotlib
                            
                                What are equivalent functions of MULTI and EXEC commands in redis-py?
                            
                                Pycharm warning: must implement all abstract methods
                            
                                Specify cython output file
                            
                                Difference between os.path.dirname(os.path.abspath(__file__)) and os.path.dirname(__file__)
                            
                                Axes class - set explicitly size (width/height) of axes in given units
                            
                                pd.read_hdf throws 'cannot set WRITABLE flag to True of this array'
                            
                                Why does pip install matplotlib version 0.91.1 when PyPi shows version 1.0.0?
                            
                                Python - Setting a datetime in a specific timezone (without UTC conversions)
                            
                                How to split an array according to a condition in numpy?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

utf8 codec can't decode byte 0x96 in python

Tags:

python

Vishal Khialani

People also ask

1 Answers

Jon Hanna

Recent Activity

Donate For Us