I know that KeyErrors are fairly common with BeautifulSoup and, before you yell RTFM at me, I have done extensive reading in both the Python documentation and the BeautifulSoup documentation. Now that that's aside, I still haven't a clue what's going on with KeyErrors.
Here's the program I'm trying to run which constantly and consistently results in a KeyError on the last element of the URLs list.
I come from a C++ background, just to let you know, but I need to use BeautifulSoup for work, doing this in C++ would be an imaginable nightmare!
The idea is to return a list of all URLs in a website that contain on their pages links to a certain URL.
Here's what I got so far:
import urllib
from BeautifulSoup import BeautifulSoup
URLs = []
Locations = []
URLs.append("http://www.tuftsalumni.org")
def print_links (link):
if (link.startswith('/') or link.startswith('http://www.tuftsalumni')):
if (link.startswith('/')):
link = "STARTING_WEBSITE" + link
print (link)
htmlSource = urllib.urlopen(link).read(200000)
soup = BeautifulSoup(htmlSource)
for item in soup.fetch('a'):
if (item['href'].startswith('/') or
"tuftsalumni" in item['href']):
URLs.append(item['href'])
length = len(URLs)
if (item['href'] == "SITE_ON_PAGE"):
if (check_list(link, Locations) == "no"):
Locations.append(link)
def check_list (link, array):
for x in range (0, len(array)):
if (link == array[x]):
return "yes"
return "no"
print_links(URLs[0])
for x in range (0, (len(URLs))):
print_links(URLs[x])
The error I get is on the next to last element of URLs:
File "scraper.py", line 35, in <module>
print_links(URLs[x])
File "scraper.py", line 16, in print_links
if (item['href'].startswith('/') or
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site- packages/BeautifulSoup.py", line 613, in __getitem__
return self._getAttrMap()[key]
KeyError: 'href'
Now I know I need to use get() to handle the KeyError default case. I have absolutely no idea how to actually do that, despite literally an hour of searching.
Thank you, if I can clarify this at all please do let me know.
The most common problem (with many modern pages): this page uses JavaScript to add elements but requests / BeautifulSoup can't run JavaScript . You may need to use Selenium to control real web browser which can run JavaScript . I use xpath but you may also use css selector .
BeautifulSoup with lxmlparser is written in pure python and slow. The internet is unanimous, one must install and use lxml alongside BeautifulSoup. lxml is a C parser that should be much much faster.
If you just want to handle the error, you can catch the exception:
for item in soup.fetch('a'):
try:
if (item['href'].startswith('/') or "tuftsalumni" in item['href']):
(...)
except KeyError:
pass # or some other fallback action
You can specify a default using item.get('key','default')
, but I don't think that's what you need in this case.
Edit: If everything else fails, this is a barebones version that should be a reasonable starting point:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import urllib
from BeautifulSoup import BeautifulSoup
links = ["http://www.tuftsalumni.org"]
def print_hrefs(link):
htmlSource = urllib.urlopen(link).read()
soup = BeautifulSoup(htmlSource)
for item in soup.fetch('a'):
print item['href']
for link in links:
print_hrefs(link)
Also, check_list(item, l)
can be replaced by item in l
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With