UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 34: unexpected end of data

Tags:

I'm trying to write a scraper , but I'm having issues with encoding. When I tried to copy the string I was looking for into my text file, python2.7 told me it didn't recognize the encoding, despite no special characters. Don't know if that's useful info.

My code looks like this:

from urllib import FancyURLopener
import os

class MyOpener(FancyURLopener): #spoofs a real browser on Window
   version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'

print "What is the webaddress?"
webaddress = raw_input("8::>")

print "Folder Name?"
foldername = raw_input("8::>")

if not os.path.exists(foldername):
    os.makedirs(foldername)

def urlpuller(start, page):
   while page[start]!= '"':
      start += 1
   close = start
   while page[close]!='"':
      close += 1
   return page[start:close]

myopener = MyOpener()

response = myopener.open(webaddress)
site = response.read()

nexturl = ''
counter = 0

while(nexturl!=webaddress):
   counter += 1
   start = 0
   
   for i in range(len(site)-35):
       if site[i:i+35].decode('utf-8') == u'<img id="imgSized" class="slideImg"':
         start = i + 40
         break
   else:
      print "Something's broken, chief. Error = 1"
   
   next = 0
   
   for i in range(start, 8, -1):
      if site[i:i+8] == u'<a href=':
         next = i
         break
   else:
      print "Something's broken, chief. Error = 2"
   
   nexturl = urlpuller(next, site)
   
   myopener.retrieve(urlpuller(start,site),foldername+'/'+foldername+str(counter)+'.jpg')

print("Retrieval of "+foldername+" completed.")

When I try to run it using the site I'm using, it returns the error:

Traceback (most recent call last):
  File "yada/yadayada/Python/scraper.py", line 37, in <module>
    if site[i:i+35].decode('utf-8') == u'<img id="imgSized" class="slideImg"':
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 34: unexpected end of data

When pointed at http://google.com, it worked just fine.

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

but when I try to decode using utf-8, as you can see, it does not work.

Any suggestions?

844

asked Jun 02 '14 22:06

user3701032

2 Answers

site[i:i+35].decode('utf-8')

You cannot randomly partition the bytes you've received and then ask UTF-8 to decode it. UTF-8 is a multibyte encoding, meaning you can have anywhere from 1 to 6 bytes to represent one character. If you chop that in half, and ask Python to decode it, it will throw you the unexpected end of data error.

Look into a tool that has this built for you. BeautifulSoup or lxml are two alternatives.

125

answered Oct 19 '22 06:10

14 revs, 12 users 16%

Open the csv file in sublime and "Save with Encoding" -> UTF-8.

answered Oct 19 '22 06:10

ssareen

Related questions
                            
                                Where do I put IPython configuration files?
                            
                                Using an asynchronous warning source for CodeMirror's lint feature
                            
                                Setting Assumptions on Variables in Sympy Relative to Other Variables
                            
                                Python using ctypes to pass a char * array and populate results
                            
                                Create a python executable using setuptools
                            
                                How to ignore local python when building python from source
                            
                                Python 3 replacement of string regex
                            
                                Change date of a DateTimeIndex
                            
                                Django Crispy Form Split Field Layouts
                            
                                Python attributeError on __del__
                            
                                argparse set default to multiple args
                            
                                Python: calling function from imported file
                            
                                Testing file upload with Flask and Python 3
                            
                                How to install PyQt5 in PyCharm?
                            
                                Compute Jordan normal form of matrix in Python / NumPy
                            
                                Virtualenv no module named zlib
                            
                                Reverse for 'index' with arguments '()' and keyword arguments '{}' not found. 0 pattern(s) tried: []
                            
                                Can a "with" statement be used conditionally?
                            
                                Using True/False as keys - how/why does this work?
                            
                                What ports does pip use?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 34: unexpected end of data

Tags:

python

character-encoding

utf-8

decoding

user3701032

People also ask

2 Answers

14 revs, 12 users 16%

ssareen

Recent Activity

Donate For Us