I am trying to crawl a page but I have a UnicodeDecodeError. Here is my code:
def soup_def(link):
req = urllib2.Request(link, headers={'User-Agent' : "Magic Browser"})
usock = urllib2.urlopen(req)
encoding = usock.headers.getparam('charset')
page = usock.read().decode(encoding)
usock.close()
soup = BeautifulSoup(page)
return soup
soup = soup_def("http://www.geekbuying.com/item/Ainol-Novo-10-Hero-II-Quad-Core--Tablet-PC-10-1-inch-IPS-1280-800-1GB-RAM-16GB-ROM-Android-4-1--HDMI-313618.html")
And the error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 284: invalid start byte
I checked that a few more users had the same error, but I cannot figure any solution.
Another possibility is a hidden file which you are trying to parse (which is very common on Macs).
Add in a simple if statement so that you are only creating BeautifulSoup objects which are actually html files:
for root, dirs, files in os.walk(folderPath, topdown = True):
for fileName in files:
if fileName.endswith(".html"):
soup = BeautifulSoup(open(os.path.join(root, fileName)).read(), 'lxml')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With