I know that KeyErrors are fairly common with BeautifulSoup and, before you yell RTFM at me, I have done extensive reading in both the Python documentation and the BeautifulSoup documentation. Now that that's aside, I still haven't a clue what's going on with KeyErrors.
Here's the program I'm trying to run which constantly and consistently results in a KeyError on the last element of the URLs list.
I come from a C++ background, just to let you know, but I need to use BeautifulSoup for work, doing this in C++ would be an imaginable nightmare!
The idea is to return a list of all URLs in a website that contain on their pages links to a certain URL.
Here's what I got so far:
import urllib
from BeautifulSoup import BeautifulSoup
URLs = []
Locations = []
URLs.append("http://www.tuftsalumni.org")
def print_links (link):
    if (link.startswith('/') or link.startswith('http://www.tuftsalumni')):
        if (link.startswith('/')):
            link = "STARTING_WEBSITE" + link
        print (link)
        htmlSource = urllib.urlopen(link).read(200000)
        soup = BeautifulSoup(htmlSource)
        for item in soup.fetch('a'):
            if (item['href'].startswith('/') or 
                "tuftsalumni" in item['href']):
                URLs.append(item['href'])
            length = len(URLs)
            if (item['href'] == "SITE_ON_PAGE"):
                if (check_list(link, Locations) == "no"):
                    Locations.append(link)
def check_list (link, array):
    for x in range (0, len(array)):
        if (link == array[x]):
            return "yes"
    return "no"
print_links(URLs[0])
for x in range (0, (len(URLs))):
    print_links(URLs[x]) 
The error I get is on the next to last element of URLs:
File "scraper.py", line 35, in <module>
    print_links(URLs[x])
  File "scraper.py", line 16, in print_links
    if (item['href'].startswith('/') or 
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-   packages/BeautifulSoup.py", line 613, in __getitem__
    return self._getAttrMap()[key]
KeyError: 'href'   
Now I know I need to use get() to handle the KeyError default case. I have absolutely no idea how to actually do that, despite literally an hour of searching.
Thank you, if I can clarify this at all please do let me know.
The most common problem (with many modern pages): this page uses JavaScript to add elements but requests / BeautifulSoup can't run JavaScript . You may need to use Selenium to control real web browser which can run JavaScript . I use xpath but you may also use css selector .
BeautifulSoup with lxmlparser is written in pure python and slow. The internet is unanimous, one must install and use lxml alongside BeautifulSoup. lxml is a C parser that should be much much faster.
If you just want to handle the error, you can catch the exception:
    for item in soup.fetch('a'):
        try:
            if (item['href'].startswith('/') or "tuftsalumni" in item['href']):
            (...)
        except KeyError:
            pass # or some other fallback action
You can specify a default using item.get('key','default'), but I don't think that's what you need in this case.
Edit: If everything else fails, this is a barebones version that should be a reasonable starting point:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import urllib
from BeautifulSoup import BeautifulSoup
links = ["http://www.tuftsalumni.org"]
def print_hrefs(link):
    htmlSource = urllib.urlopen(link).read()
    soup = BeautifulSoup(htmlSource)
    for item in soup.fetch('a'):
        print item['href']
for link in links:
    print_hrefs(link)
Also, check_list(item, l) can be replaced by item in l.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With