So I was looking at some source code and I came across this bit of code
<img src="/gallery/2012-winners-finalists/HM_Watching%20birds2_Shane%20Conklin_MA_2012.jpg"
now in the source code the link is blue and when you click it, it takes you to the full URL where that picture is located, I know how to get what is shown in the source code in Python using Beautiful Soup I was wondering though how to get the full URL you get once clicking the link in the source code?
EDIT:
if I was given <a href = "/folder/big/a.jpg"
how do you figure out the starting part of that url through python or beautiful soup?
Use the a tag to extract the links from the BeautifulSoup object. Get the actual URLs from the form all anchor tag objects with get() method and passing href argument to it. Moreover, you can get the title of the URLs with get() method and passing title argument to it.
Method 1: Using descendants and find() First, import the required modules, then provide the URL and create its requests object that will be parsed by the beautifulsoup object. Now with the help of find() function in beautifulsoup we will find the <body> and its corresponding <ul> tags.
<a href="/folder/big/a.jpg">
That’s an absolute address for the current host. So if the HTML file is at http://example.com/foo/bar.html
, then applying the url /folder/big/a.jpg
will result in this:
http://example.com/folder/big/a.jpg
I.e. take the host name and apply the new path to it.
Python has the builtin urljoin
function to perform this operation for you:
>>> from urllib.parse import urljoin
>>> base = 'http://example.com/foo/bar.html'
>>> href = '/folder/big/a.jpg'
>>> urljoin(base, href)
'http://example.com/folder/big/a.jpg'
For Python 2, the function is within the urlparse
module.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With