Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Decomposing HTML to link text and target

Given an HTML link like

<a href="urltxt" class="someclass" close="true">texttxt</a>

how can I isolate the url and the text?

Updates

I'm using Beautiful Soup, and am unable to figure out how to do that.

I did

soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))

links = soup.findAll('a')

for link in links:
    print "link content:", link.content," and attr:",link.attrs

i get

*link content: None  and attr: [(u'href', u'_redirectGeneric.asp?genericURL=/root    /support.asp')]*  ...
...

Why am i missing the content?

edit: elaborated on 'stuck' as advised :)

like image 253
sundeep Avatar asked Nov 13 '08 00:11

sundeep


4 Answers

Use Beautiful Soup. Doing it yourself is harder than it looks, you'll be better off using a tried and tested module.

EDIT:

I think you want:

soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url).read())

By the way, it's a bad idea to try opening the URL there, as if it goes wrong it could get ugly.

EDIT 2:

This should show you all the links in a page:

import urlparse, urllib
from BeautifulSoup import BeautifulSoup

url = "http://www.example.com/index.html"
source = urllib.urlopen(url).read()

soup = BeautifulSoup(source)

for item in soup.fetchall('a'):
    try:
        link =  urlparse.urlparse(item['href'].lower())
    except:
        # Not a valid link
        pass
    else:
        print link
like image 182
Harley Holcombe Avatar answered Nov 10 '22 23:11

Harley Holcombe


Here's a code example, showing getting the attributes and contents of the links:

soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))
for link in soup.findAll('a'):
    print link.attrs, link.contents
like image 21
Jerub Avatar answered Nov 11 '22 00:11

Jerub


Looks like you have two issues there:

  1. link.contents, not link.content
  2. attrs is a dictionary, not a string. It holds key value pairs for each attribute in an HTML element. link.attrs['href'] will get you what you appear to be looking for, but you'd want to wrap that in a check in case you come across an a tag without an href attribute.
like image 4
Tom Avatar answered Nov 11 '22 01:11

Tom


Though I suppose the others might be correct in pointing you to using Beautiful Soup, they might not, and using an external library might be massively over-the-top for your purposes. Here's a regex which will do what you ask.

/<a\s+[^>]*?href="([^"]*)".*?>(.*?)<\/a>/

Here's what it matches:

'<a href="url" close="true">text</a>'
// Parts: "url", "text"

'<a href="url" close="true">text<span>something</span></a>'
// Parts: "url", "text<span>something</span>"

If you wanted to get just the text (eg: "textsomething" in the second example above), I'd just run another regex over it to strip anything between pointed brackets.

like image 3
nickf Avatar answered Nov 11 '22 01:11

nickf