I'm new to Web scraping and just started using BeautifulSoup. Here's my question.
When you look up a word in Google in such a way using a search query like "define:lucid", in most cases, a panel showing the meaning and the pronunciation appears at the front page. (Shown in the left side of the embedded image)
[Google default dictionary example]
Things I want to scrape and collect automatically are the text of the meaning and the URL in which the mp3 data of the pronunciation is stored. Using the Chrome Inspector manually, these are easily found in its "Elements" section, e.g., the Inspector (shown in the right side of the image) shows the URL, which stores the mp3 data of the pronunciation of "lucid" (here).
However, using requests to get the content of the HTML of the search result and parsing it with BeautifulSoup, like the code below, the soup
gets only a few of contents in the panel such as the IPA "/ˈluːsɪd/" and the attribute "adjective" like the result below, and none of the contents I need can be found, such as things in audio elements.
How can I get the information with BeautifulSoup if possible, otherwise what alternative tools are suitable for this task?
P.S. I think the quality of pronunciation from Google dictionary is better than ones from any other dictionary sites. So I want to stick to it.
Code:
import requests
from bs4 import BeautifulSoup
query = "define:lucid"
goog_search = "https://www.google.co.uk/search?q=" + query
r = requests.get(goog_search)
soup = BeautifulSoup(r.text, "html.parser")
print(soup.prettify())
Part of soup
content:
</span>
<span style="font:smaller 'Doulos SIL','Gentum','TITUS Cyberbit Basic','Junicode','Aborigonal Serif','Arial Unicode MS','Lucida Sans Unicode','Chrysanthi Unicode';padding-left:15px">
/ˈluːsɪd/
</span>
</div>
</h3>
<table style="font-size:14px;width:100%">
<tr>
<td>
<div style="color:#666;padding:5px 0">
adjective
</div>
The basic request you run is not returning the parts of the page rendered via JavaScript. If you right-click in Chrome and select View Page Source the audio link is not there. Solution: you could render the page via selenium
. With the below code I get the <audio>
tag including the link.
You'll have to pip install selenium
, download ChromeDriver and add the folder containing it to PATH
like export PATH=$PATH:~/downloads/
import requests
from bs4 import BeautifulSoup
import time
from selenium import webdriver
def render_page(url):
driver = webdriver.Chrome()
driver.get(url)
time.sleep(3)
r = driver.page_source
#driver.quit()
return r
query = "define:lucid"
goog_search = "https://www.google.co.uk/search?q=" + query
r = render_page(goog_search)
soup = BeautifulSoup(r, "html.parser")
print(soup.prettify())
I checked it. You're right, in the BeautifulSoup output there is no audio elements for some reason. However, having inspected the code, I found a source for the audio file which Google is using, which is http://ssl.gstatic.com/dictionary/static/sounds/oxford/lucid--_gb_1.mp3 and which perfectly works if you substitute "lucid" with any other word.
So, if you need to scrape the audio file, you could just do the following:
url='http://ssl.gstatic.com/dictionary/static/sounds/oxford/'
audio=requests.get(url+'lucid'+'--_gb_1.mp3', stream=True).content
with open('lucid'+'.mp3', 'wb') as f:
f.write(audio)
As for other elements, I'm afraid you'll need just to find the word "definition" in the soup and scrape the content of the tag that contains it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With