A google search gives me the following first result on HTML:
<h3 class="r"><a href="https://rads.stackoverflow.com/amzn/click/com/0470284889" rel="nofollow noreferrer" class="l vst" onmousedown="return rwt(this,'','','','1','AFQjCNEv1W9YC2jcSKYdEo2kNqBMJ-Utmg','k89K9hF4cVNpxQYHtEKiUQ','0CCoQFjAA',null,event)"><em>Quantitative Trading</em>: <em>How to Build Your Own Algorithmic</em> <b>...</b> - Amazon</a></h3>
I would like to extract the link http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889 from this, but when I use beautiful soup to extract the information, I obtain
soup.find("h3").find("a").get("href")
I obtain the following string instead:
/url?q=http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889&sa=U&ei=P2ycT6OoNuasiAL2ncV5&ved=0CBIQFjAA&usg=AFQjCNEo_ujANAKnjheWDRlBKnJ1BGeA7A
I know that the link is in there and I could parse it by deleting the /url?q= and everything after the & symbol, but I was wondering if there was a cleaner solution.
Thanks!
You can use a combination of urlparse.urlparse and urlparse.parse_qs, e.g
>>> import urlparse
>>> url = '/url?q=http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889&sa=U&ei=P2ycT6OoNuasiAL2ncV5&ved=0CBIQFjAA&usg=AFQjCNEo_ujANAKnjheWDRlBKnJ1BGe'
>>> data = urlparse.parse_qs(
... urlparse.urlparse(url).query
... )
>>> data
{'ei': ['P2ycT6OoNuasiAL2ncV5'],
'q': ['http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889'],
'sa': ['U'],
'usg': ['AFQjCNEo_ujANAKnjheWDRlBKnJ1BGe'],
'ved': ['0CBIQFjAA']}
>>> data['q'][0]
'http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889'
To extract only the first result from the page you can use select_one() by passing a CSS selectors or find() bs4 methods.
Code and example in the online IDE:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
# passing parameters in URLs
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {'q': 'Quantitative Trading How to Build Your Own Algorithmic - amazon'}
def bs4_get_first_googlesearch():
html = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')
first_link = soup.select_one('.yuRUbf').a['href']
print(first_link)
bs4_get_first_googlesearch()
# output:
'''
https://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889
'''
Alternatively, you can do the same thing using Google Search Engine Results API from SerpApi. It's a paid API with a free trial of 5,000 searches. Check out the playground.
The big difference is that everything is already done for the end-user: selecting elements, bypass blocking, proxy rotation, and more.
Code to integrate:
from serpapi import GoogleSearch
import os
def serpapi_get_first_googlesearch():
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "Quantitative Trading How to Build Your Own Algorithmic - amazon",
"hl": "en",
}
search = GoogleSearch(params)
results = search.get_dict()
# [0] - first element from the search results
first_link = results['organic_results'][0]['link']
print(first_link)
serpapi_get_first_googlesearch()
# output:
'''
https://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889
'''
Disclaimer, I work for SerpApi.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With