Given an absolute url of a page, and a relative link found within that page, would there be a way to a) definitively reconstruct or b) best-effort reconstruct the absolute url of the relative link? In my case, I'm reading an html file from a given url using beautiful soup, stripping out all the img tag sources, and trying to construct a list of absolute urls to the page images. My Python function so far looks like: <pre class="prettyprint"><code>function get_image_url(page_url,image_src): from urlparse import urlparse # parsed = urlparse('http://user:pass@NetLoc:80/path;parameters?query=argument#fragment') parsed = urlparse(page_url) url_base = parsed.netloc url_path = parsed.path if src.find('http') == 0: # It's an absolute URL, do nothing. pass elif src.find('/') == 0: # If it's a root URL, append it to the base URL: src = 'http://' + url_base + src else: # If it's a relative URL, ? </code></pre> NOTE: Don't need a Python answer, just the logic required.

very simple: <pre class="prettyprint"><code>>>> from urlparse import urljoin >>> urljoin('http://mysite.com/foo/bar/x.html', '../../images/img.png') 'http://mysite.com/images/img.png' </code></pre>

Reconstructing absolute urls from relative urls on a page

Tags:

python

html

url-parsing

Given an absolute url of a page, and a relative link found within that page, would there be a way to a) definitively reconstruct or b) best-effort reconstruct the absolute url of the relative link?

In my case, I'm reading an html file from a given url using beautiful soup, stripping out all the img tag sources, and trying to construct a list of absolute urls to the page images.

My Python function so far looks like:

function get_image_url(page_url,image_src):

    from urlparse import urlparse
    # parsed = urlparse('http://user:pass@NetLoc:80/path;parameters?query=argument#fragment')
    parsed = urlparse(page_url)
    url_base = parsed.netloc
    url_path = parsed.path

    if src.find('http') == 0:
        # It's an absolute URL, do nothing.
        pass
    elif src.find('/') == 0:
        # If it's a root URL, append it to the base URL:
        src = 'http://' + url_base + src
    else:
        # If it's a relative URL, ?

NOTE: Don't need a Python answer, just the logic required.

262

asked Mar 15 '12 11:03

Yarin

2 Answers

very simple:

>>> from urlparse import urljoin >>> urljoin('http://mysite.com/foo/bar/x.html', '../../images/img.png') 'http://mysite.com/images/img.png'

answered Sep 28 '22 07:09

Not_a_Golfer

Use urllib.parse.urljoin to resolve a (possibly relative) URL against a base URL.

But, the base URL of a web page isn't necessarily the same as the URL you fetched the document from, because HTML allows a page to specify its preferred base URL via the BASE element. The logic you need is as follows:

base_url = page_url
head = document.getElementsByTagName('head')[0]
for base in head.getElementsByTagName('base'):
    if base.hasAttribute('href'):
        base_url = urllib.parse.urljoin(base_url, base.getAttribute('href'))
        # HTML5 4.2.3 "if there are multiple base elements with href
        # attributes, all but the first are ignored."
        break

(If you are parsing XHTML then in theory you ought to take into account the rather hairy XML Base specification instead. But you can probably get away without worrying about that, since no-one really uses XHTML.)

answered Sep 28 '22 08:09

Gareth Rees

Related questions
                            
                                python fabric no host found must manually set 'env.host_string'
                            
                                Where are Python's stdlib tests?
                            
                                Why does using /usr/bin/env break my Python import?
                            
                                Python, get index from list of lists
                            
                                Safely remove all html code from a string in python
                            
                                Calculating items included in branch and bound knapsack
                            
                                Scrapy spider difference between Crawled pages and Scraped items
                            
                                What is the difference between ! and !! in yaml?
                            
                                PyDev Code Analysis not working in Aptana Studio
                            
                                Testing a function that can return non-deterministic results using Python unittest
                            
                                Python setup, install one module as a sub module of another module?
                            
                                Python logger prints the same output several times in multithreaded environment [duplicate]
                            
                                Using numpy vector elements in Fraction module in Python
                            
                                adding numpy arrays of differing shapes
                            
                                Test if matrix is invertible over finite field
                            
                                python frameworks for a real time website
                            
                                Python: date, time formatting
                            
                                PyCharm: DJANGO_SETTINGS_MODULE is undefined
                            
                                django bulk create ignore duplicates [duplicate]
                            
                                How to implement login required decorator in Flask

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With