Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Beautiful Soup and UnicodeDecodeError

I am trying to crawl a page but I have a UnicodeDecodeError. Here is my code:

def soup_def(link):
    req = urllib2.Request(link, headers={'User-Agent' : "Magic Browser"}) 
    usock = urllib2.urlopen(req)
    encoding = usock.headers.getparam('charset')
    page = usock.read().decode(encoding)
    usock.close()
    soup = BeautifulSoup(page)
    return soup

soup = soup_def("http://www.geekbuying.com/item/Ainol-Novo-10-Hero-II-Quad-Core--Tablet-PC-10-1-inch-IPS-1280-800-1GB-RAM-16GB-ROM-Android-4-1--HDMI-313618.html")

And the error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 284: invalid start byte

I checked that a few more users had the same error, but I cannot figure any solution.

like image 907
Tasos Avatar asked Nov 13 '13 14:11

Tasos


1 Answers

Another possibility is a hidden file which you are trying to parse (which is very common on Macs).

Add in a simple if statement so that you are only creating BeautifulSoup objects which are actually html files:

for root, dirs, files in os.walk(folderPath, topdown = True):
    for fileName in files:
        if fileName.endswith(".html"):
            soup = BeautifulSoup(open(os.path.join(root, fileName)).read(), 'lxml')
like image 78
drdrb Avatar answered Oct 20 '22 23:10

drdrb