So lxml has a very hand feature: make_links_absolute:
doc = lxml.html.fromstring(some_html_page)
doc.make_links_absolute(url_for_some_html_page)
and all the links in doc are absolute now. Is there an easy equivalent in BeautifulSoup or do I simply need to pass it through urlparse and normalize it:
soup = BeautifulSoup(some_html_page)
for tag in soup.findAll('a', href=True):
url_data = urlparse(tag['href'])
if url_data[0] == "":
full_url = url_for_some_html_page + test_url
In my answer to What is a simple way to extract the list of URLs on a webpage using python? I covered that incidentally as part of the extraction step; you could easily write a method to do it on the soup and not just extract it.
from urllib.parse import urljoin
def make_links_absolute(soup, url):
for tag in soup.findAll('a', href=True):
tag['href'] = urljoin(url, tag['href'])
(Python 2: from urlparse import urljoin
.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With