Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I get href links from HTML using Python?

import urllib2  website = "WEBSITE" openwebsite = urllib2.urlopen(website) html = getwebsite.read()  print html 

So far so good.

But I want only href links from the plain text HTML. How can I solve this problem?

like image 724
user371012 Avatar asked Jun 19 '10 12:06

user371012


People also ask

How do you find the href in Python?

To get href with Python BeautifulSoup, we can use the find_all method. to create soup object with BeautifulSoup class called with the html string. Then we find the a elements with the href attribute returned by calling find_all with 'a' and href set to True .

How do you scrape href in Python?

Steps to be followed:Create a function to get the HTML document from the URL using requests. get() method by passing URL to it. Create a Parse Tree object i.e. soup object using of BeautifulSoup() method, passing it HTML document extracted above and Python built-in HTML parser.


1 Answers

Try with Beautifulsoup:

from BeautifulSoup import BeautifulSoup import urllib2 import re  html_page = urllib2.urlopen("http://www.yourwebsite.com") soup = BeautifulSoup(html_page) for link in soup.findAll('a'):     print link.get('href') 

In case you just want links starting with http://, you should use:

soup.findAll('a', attrs={'href': re.compile("^http://")}) 

In Python 3 with BS4 it should be:

from bs4 import BeautifulSoup import urllib.request  html_page = urllib.request.urlopen("http://www.yourwebsite.com") soup = BeautifulSoup(html_page, "html.parser") for link in soup.findAll('a'):     print(link.get('href')) 
like image 174
systempuntoout Avatar answered Sep 18 '22 15:09

systempuntoout