Python module BeautifulSoup extracting anchors href

Question

i am using BeautifulSoup module to select all href from html by this way:

def extract_links(html):
  soup = BeautifulSoup(html)
  anchors = soup.findAll('a')
  print anchors
  links = []
  for a in anchors:
    links.append(a['href'])
  return links

but sometime it failed by this error message:

Traceback (most recent call last):
File "C:\py\main.py", line 33, in <module>
urls = extract_links(page)
File "C:\py\main.py", line 11, in extract_links
links.append(a['href'])
File "C:\py\BeautifulSoup.py", line 601, in __getitem__
return self._getAttrMap()[key]
KeyError: 'href'

tjarratt · Accepted Answer

Not all anchor tags will have an href attribute. You should check that the anchor has an href before you try to access that attribute.

if a.has_key('href')
  links.append(a['href'])

After checking some comments here, I think this is the most pythonic way of handling this case.

Matt Luongo · Answer

Try this.

links = [a['href'] for a in anchors if a.has_key('href')]

Or, if you'd rather mutate an existing list

links = []
#...
links.extend(a['href'] for a in anchors if a.has_key('href'))

Aditya · Answer

soup.findAll() returns a list of "tags", that contain dictionaries of attributes. So you need to extract its attributes and work on them.

Taking your example and modifying, this is the code that works:

def extract_links(html):
  soup = BeautifulSoup(html)
  anchors = soup.findAll('a')
  print anchors
  links = []
  for a in anchors:
    if a.attrs.has_key('href'):
      links.append(a['href'])
return links

Python module BeautifulSoup extracting anchors href

Tags:

python

html

beautifulsoup

Michal

3 Answers

tjarratt

Matt Luongo

Aditya

Recent Activity

Donate For Us

Python module BeautifulSoup extracting anchors href

Tags:

python

html

beautifulsoup

Michal

3 Answers

tjarratt

Matt Luongo

Aditya

Related questions

Recent Activity

Donate For Us