Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reconstructing absolute urls from relative urls on a page

Given an absolute url of a page, and a relative link found within that page, would there be a way to a) definitively reconstruct or b) best-effort reconstruct the absolute url of the relative link?

In my case, I'm reading an html file from a given url using beautiful soup, stripping out all the img tag sources, and trying to construct a list of absolute urls to the page images.

My Python function so far looks like:

function get_image_url(page_url,image_src):

    from urlparse import urlparse
    # parsed = urlparse('http://user:pass@NetLoc:80/path;parameters?query=argument#fragment')
    parsed = urlparse(page_url)
    url_base = parsed.netloc
    url_path = parsed.path

    if src.find('http') == 0:
        # It's an absolute URL, do nothing.
        pass
    elif src.find('/') == 0:
        # If it's a root URL, append it to the base URL:
        src = 'http://' + url_base + src
    else:
        # If it's a relative URL, ?

NOTE: Don't need a Python answer, just the logic required.

like image 262
Yarin Avatar asked Mar 15 '12 11:03

Yarin


People also ask

How do you make an absolute URL?

If you prefix the URL with // it will be treated as an absolute one. For example: <a href="//google.com">Google</a> . Keep in mind this will use the same protocol the page is being served with (e.g. if your page's URL is https://path/to/page the resulting URL will be https://google.com ).

What is absolute URL and relative URL in HTML?

An absolute URL contains all the information necessary to locate a resource. A relative URL locates a resource using an absolute URL as a starting point. In effect, the "complete URL" of the target is specified by concatenating the absolute and relative URLs.

What do you mean by URL give one example of relative and one example of absolute URL?

For example, http://sparkpay.com is an absolute URL. Relative URL - A relative URL typically contains only the path to a specific file. In context to the AmeriCommerce online stores system, these typically begin with a forward slash. The forward slash basically tells the browser to go to the domain of the site.


2 Answers

very simple:

>>> from urlparse import urljoin >>> urljoin('http://mysite.com/foo/bar/x.html', '../../images/img.png') 'http://mysite.com/images/img.png' 
like image 83
Not_a_Golfer Avatar answered Sep 28 '22 07:09

Not_a_Golfer


Use urllib.parse.urljoin to resolve a (possibly relative) URL against a base URL.

But, the base URL of a web page isn't necessarily the same as the URL you fetched the document from, because HTML allows a page to specify its preferred base URL via the BASE element. The logic you need is as follows:

base_url = page_url
head = document.getElementsByTagName('head')[0]
for base in head.getElementsByTagName('base'):
    if base.hasAttribute('href'):
        base_url = urllib.parse.urljoin(base_url, base.getAttribute('href'))
        # HTML5 4.2.3 "if there are multiple base elements with href
        # attributes, all but the first are ignored."
        break

(If you are parsing XHTML then in theory you ought to take into account the rather hairy XML Base specification instead. But you can probably get away without worrying about that, since no-one really uses XHTML.)

like image 32
Gareth Rees Avatar answered Sep 28 '22 08:09

Gareth Rees