Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup KeyError Issue

I know that KeyErrors are fairly common with BeautifulSoup and, before you yell RTFM at me, I have done extensive reading in both the Python documentation and the BeautifulSoup documentation. Now that that's aside, I still haven't a clue what's going on with KeyErrors.

Here's the program I'm trying to run which constantly and consistently results in a KeyError on the last element of the URLs list.

I come from a C++ background, just to let you know, but I need to use BeautifulSoup for work, doing this in C++ would be an imaginable nightmare!

The idea is to return a list of all URLs in a website that contain on their pages links to a certain URL.

Here's what I got so far:

import urllib
from BeautifulSoup import BeautifulSoup

URLs = []
Locations = []
URLs.append("http://www.tuftsalumni.org")

def print_links (link):
    if (link.startswith('/') or link.startswith('http://www.tuftsalumni')):
        if (link.startswith('/')):
            link = "STARTING_WEBSITE" + link
        print (link)
        htmlSource = urllib.urlopen(link).read(200000)
        soup = BeautifulSoup(htmlSource)
        for item in soup.fetch('a'):
            if (item['href'].startswith('/') or 
                "tuftsalumni" in item['href']):
                URLs.append(item['href'])
            length = len(URLs)
            if (item['href'] == "SITE_ON_PAGE"):
                if (check_list(link, Locations) == "no"):
                    Locations.append(link)



def check_list (link, array):
    for x in range (0, len(array)):
        if (link == array[x]):
            return "yes"
    return "no"

print_links(URLs[0])

for x in range (0, (len(URLs))):
    print_links(URLs[x]) 

The error I get is on the next to last element of URLs:

File "scraper.py", line 35, in <module>
    print_links(URLs[x])
  File "scraper.py", line 16, in print_links
    if (item['href'].startswith('/') or 
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-   packages/BeautifulSoup.py", line 613, in __getitem__
    return self._getAttrMap()[key]
KeyError: 'href'   

Now I know I need to use get() to handle the KeyError default case. I have absolutely no idea how to actually do that, despite literally an hour of searching.

Thank you, if I can clarify this at all please do let me know.

like image 549
James Roseman Avatar asked Mar 08 '12 00:03

James Roseman


People also ask

Why is Beautiful Soup not working?

The most common problem (with many modern pages): this page uses JavaScript to add elements but requests / BeautifulSoup can't run JavaScript . You may need to use Selenium to control real web browser which can run JavaScript . I use xpath but you may also use css selector .

Is Beautiful Soup slow?

BeautifulSoup with lxmlparser is written in pure python and slow. The internet is unanimous, one must install and use lxml alongside BeautifulSoup. lxml is a C parser that should be much much faster.


1 Answers

If you just want to handle the error, you can catch the exception:

    for item in soup.fetch('a'):
        try:
            if (item['href'].startswith('/') or "tuftsalumni" in item['href']):
            (...)
        except KeyError:
            pass # or some other fallback action

You can specify a default using item.get('key','default'), but I don't think that's what you need in this case.

Edit: If everything else fails, this is a barebones version that should be a reasonable starting point:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib
from BeautifulSoup import BeautifulSoup

links = ["http://www.tuftsalumni.org"]

def print_hrefs(link):
    htmlSource = urllib.urlopen(link).read()
    soup = BeautifulSoup(htmlSource)
    for item in soup.fetch('a'):
        print item['href']

for link in links:
    print_hrefs(link)

Also, check_list(item, l) can be replaced by item in l.

like image 83
Eduardo Ivanec Avatar answered Oct 27 '22 03:10

Eduardo Ivanec