extracting href from <a> beautiful soup

Tags:

beautifulsoup

I'm trying to extract a link from a google search result. Inspect element tells me that the section I am interested in has "class = r". The first result looks like this:

<h3 class="r" original_target="https://en.wikipedia.org/wiki/chocolate" style="display: inline-block;">
    <a href="https://en.wikipedia.org/wiki/Chocolate" 
       ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://en.wikipedia.org/wiki/Chocolate&amp;ved=0ahUKEwjW6tTC8LXZAhXDjpQKHSXSClIQFgheMAM" 
       saprocessedanchor="true">
        Chocolate - Wikipedia
    </a>
</h3>

To extract the "href" I do:

import bs4, requests
res = requests.get('https://www.google.com/search?q=chocolate')
googleSoup = bs4.BeautifulSoup(res.text, "html.parser")
elements= googleSoup.select(".r a")
elements[0].get("href")

But I unexpectedly get:

'/url?q=https://en.wikipedia.org/wiki/Chocolate&sa=U&ved=0ahUKEwjHjrmc_7XZAhUME5QKHSOCAW8QFggWMAA&usg=AOvVaw03f1l4EU9fYd'

Where I wanted:

"https://en.wikipedia.org/wiki/Chocolate"

The attribute "ping" seems to be confusing it. Any ideas?

300

asked Feb 21 '18 03:02

1 Answers

What's happening?

If you print the response content (i.e. googleSoup.text) you'll see that you're getting a completely different HTML. The page source and the response content don't match.

This is not happening because the content is loaded dynamically; as even then, the page source and the response content are the same. (But the HTML you see while inspecting the element is different.)

A basic explanation for this is that Google recognizes the Python script and changes its response.

Solution:

You can pass a fake User-Agent to make the script look like a real browser request.

Code:

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

r = requests.get('https://www.google.co.in/search?q=chocolate', headers=headers)
soup = BeautifulSoup(r.text, 'lxml')

elements = soup.select('.r a')
print(elements[0]['href'])

Output:

https://en.wikipedia.org/wiki/Chocolate

Resources:

Sending “User-agent” using Requests library in Python
How to use Python requests to fake a browser visit?
Using headers with the Python requests library's get method

answered Sep 22 '22 12:09

Keyur Potdar

Related questions
                            
                                glm in python vs R
                            
                                Getting Error on StandardScalar Fit_Transform
                            
                                What is the benefit to using typing library classes vs. types in Python type hints?
                            
                                Python: using polygons to create a mask on a given 2d grid
                            
                                How to generate a list of dictionaries from a list of keys and the same value
                            
                                plotting spectrogram in audio analysis
                            
                                Pythonic way to group a list using a dictionary that has lists as values
                            
                                Python 3 regular expression for $ but not $$ in a string
                            
                                Running tests against existing database using pytest-django
                            
                                Matplotlib Plot Points Over Time Where Old Points Fade
                            
                                Converting Tensor to a SparseTensor for ctc_loss [duplicate]
                            
                                Django model fields not appearing in admin
                            
                                Pass from tiddlywiki list to python list
                            
                                Selenium WebDriverException: Message: unknown error: cannot determine loading status from unknown error: missing or invalid 'entry.level'
                            
                                Compare dataframe columns to series
                            
                                How to access python return value from bash script
                            
                                Communicate Between Nodejs and Python via Websockets
                            
                                How to conditionally select elements in numpy array
                            
                                python sqlalchemy query filter
                            
                                Plot multiple rows of a pandas dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With