Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

extracting href from <a> beautiful soup

I'm trying to extract a link from a google search result. Inspect element tells me that the section I am interested in has "class = r". The first result looks like this:

<h3 class="r" original_target="https://en.wikipedia.org/wiki/chocolate" style="display: inline-block;">
    <a href="https://en.wikipedia.org/wiki/Chocolate" 
       ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://en.wikipedia.org/wiki/Chocolate&amp;ved=0ahUKEwjW6tTC8LXZAhXDjpQKHSXSClIQFgheMAM" 
       saprocessedanchor="true">
        Chocolate - Wikipedia
    </a>
</h3>

To extract the "href" I do:

import bs4, requests
res = requests.get('https://www.google.com/search?q=chocolate')
googleSoup = bs4.BeautifulSoup(res.text, "html.parser")
elements= googleSoup.select(".r a")
elements[0].get("href")

But I unexpectedly get:

'/url?q=https://en.wikipedia.org/wiki/Chocolate&sa=U&ved=0ahUKEwjHjrmc_7XZAhUME5QKHSOCAW8QFggWMAA&usg=AOvVaw03f1l4EU9fYd'

Where I wanted:

"https://en.wikipedia.org/wiki/Chocolate"

The attribute "ping" seems to be confusing it. Any ideas?

like image 300
GlaceCelery Avatar asked Feb 21 '18 03:02

GlaceCelery


People also ask

What is href BeautifulSoup?

The 'BeautifulSoup' function is used to extract text from the webpage. The 'find_all' function is used to extract text from the webpage data. The href links are printed on the console.


1 Answers

What's happening?

If you print the response content (i.e. googleSoup.text) you'll see that you're getting a completely different HTML. The page source and the response content don't match.

This is not happening because the content is loaded dynamically; as even then, the page source and the response content are the same. (But the HTML you see while inspecting the element is different.)

A basic explanation for this is that Google recognizes the Python script and changes its response.

Solution:

You can pass a fake User-Agent to make the script look like a real browser request.


Code:

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

r = requests.get('https://www.google.co.in/search?q=chocolate', headers=headers)
soup = BeautifulSoup(r.text, 'lxml')

elements = soup.select('.r a')
print(elements[0]['href'])

Output:

https://en.wikipedia.org/wiki/Chocolate

Resources:

like image 97
Keyur Potdar Avatar answered Sep 22 '22 12:09

Keyur Potdar