I am trying to check if a certain word is on a page for many sites. The script runs fine for say 15 sites and then it stops.
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 15344: invalid start byte
I did a search on stackoverflow and found many issues on it but I can't seem to understand what went wrong in my case.
I would like to either solve it or if there is an error skip that site. Pls advice how I can do this as I am new and the below code itself has taken me a day to write. By the way the site which the script halted on was http://www.homestead.com
filetocheck = open("bloglistforcommenting","r") resultfile = open("finalfile","w") for countofsites in filetocheck.readlines(): sitename = countofsites.strip() htmlfile = urllib.urlopen(sitename) page = htmlfile.read().decode('utf8') match = re.search("Enter your name", page) if match: print "match found : " + sitename resultfile.write(sitename+"\n") else: print "sorry did not find the pattern " +sitename print "Finished Operations"
As per Mark's comments I changed the code to implement beautifulsoup
htmlfile = urllib.urlopen("http://www.homestead.com") page = BeautifulSoup((''.join(htmlfile))) print page.prettify()
now I am getting this error
page = BeautifulSoup((''.join(htmlfile))) TypeError: 'module' object is not callable
I am trying their quick start example from http://www.crummy.com/software/BeautifulSoup/documentation.html#Quick%20Start. If I copy paste it then the code works fine.
I FINALLY got it to work. Thank you all for your help. Here is the final code.
import urllib import re from BeautifulSoup import BeautifulSoup filetocheck = open("listfile","r") resultfile = open("finalfile","w") error ="for errors" for countofsites in filetocheck.readlines(): sitename = countofsites.strip() htmlfile = urllib.urlopen(sitename) page = BeautifulSoup((''.join(htmlfile))) pagetwo =str(page) match = re.search("Enter YourName", pagetwo) if match: print "match found : " + sitename resultfile.write(sitename+"\n") else: print "sorry did not find the pattern " +sitename print "Finished Operations"
The Python "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte" occurs when we specify an incorrect encoding when decoding a bytes object. To solve the error, specify the correct encoding, e.g. utf-16 or open the file in binary mode ( rb or wb ).
0x96 is in binary 10010110, and any byte matching the pattern 10XXXXXX (0x80 to 0xBF) can only be a second or subsequent byte in a UTF-8 encoding. Hence the stream is either not UTF-8 or else is corrupted.
The byte at 15344 is 0x96. Presumably at position 15343 there is either a single-byte encoding of a character, or the last byte of a multiple-byte encoding, making 15344 the start of a character. 0x96 is in binary 10010110, and any byte matching the pattern 10XXXXXX (0x80 to 0xBF) can only be a second or subsequent byte in a UTF-8 encoding.
Hence the stream is either not UTF-8 or else is corrupted.
Examining the URI you link to, we find the header:
Content-Type: text/html
Since there is no encoding stated, we should use the default for HTTP, which is ISO-8859-1 (aka "Latin 1").
Examining the content we find the line:
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
Which is a fall-back mechanism for people who are, for some reason, unable to set their HTTP headings correctly. This time we are explicitly told the character encoding is ISO-8859-1.
As such, there's no reason to expect reading it as UTF-8 to work.
For extra fun though, when we consider that in ISO-8859-1 0x96 encodes U+0096 which is the control character "START OF GUARDED AREA" we find that ISO-8859-1 isn't correct either. It seems the people creating the page made a similar error to yourself.
From context, it would seem that they actually used Windows-1252, as in that encoding 0x96 encodes U+2013 (EN-DASH, looks like –
).
So, to parse this particular page you want to decode in Windows-1252.
More generally, you want to examine headers when picking character encodings, and while it would perhaps be incorrect in this case (or perhaps not, more than a few "ISO-8859-1" codecs are actually Windows-1252), you'll be correct more often. You still need to have something catch failures like this by reading with a fallback. The decode
method takes a second parameter called errors
. The default is 'strict'
, but you can also have 'ignore'
, 'replace'
, 'xmlcharrefreplace'
(not appropriate), 'backslashreplace'
(not appropriate) and you can register your own fallback handler with codecs.register_error()
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With