Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Beautiful Soup to get the full URL in source code

Tags:

python

So I was looking at some source code and I came across this bit of code

<img src="/gallery/2012-winners-finalists/HM_Watching%20birds2_Shane%20Conklin_MA_2012.jpg"

now in the source code the link is blue and when you click it, it takes you to the full URL where that picture is located, I know how to get what is shown in the source code in Python using Beautiful Soup I was wondering though how to get the full URL you get once clicking the link in the source code?

EDIT: if I was given <a href = "/folder/big/a.jpg" how do you figure out the starting part of that url through python or beautiful soup?

like image 329
user2476540 Avatar asked Jul 31 '13 13:07

user2476540


People also ask

How do you get a URL on BeautifulSoup?

Use the a tag to extract the links from the BeautifulSoup object. Get the actual URLs from the form all anchor tag objects with get() method and passing href argument to it. Moreover, you can get the title of the URLs with get() method and passing title argument to it.

Which method in BeautifulSoup is used to check all URL or images?

Method 1: Using descendants and find() First, import the required modules, then provide the URL and create its requests object that will be parsed by the beautifulsoup object. Now with the help of find() function in beautifulsoup we will find the <body> and its corresponding <ul> tags.


1 Answers

<a href="/folder/big/a.jpg">

That’s an absolute address for the current host. So if the HTML file is at http://example.com/foo/bar.html, then applying the url /folder/big/a.jpg will result in this:

http://example.com/folder/big/a.jpg

I.e. take the host name and apply the new path to it.

Python has the builtin urljoin function to perform this operation for you:

>>> from urllib.parse import urljoin
>>> base = 'http://example.com/foo/bar.html'
>>> href = '/folder/big/a.jpg'
>>> urljoin(base, href)
'http://example.com/folder/big/a.jpg'

For Python 2, the function is within the urlparse module.

like image 85
poke Avatar answered Oct 20 '22 15:10

poke