Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup does not return all elements on page

I'm new to Web scraping and just started using BeautifulSoup. Here's my question.

When you look up a word in Google in such a way using a search query like "define:lucid", in most cases, a panel showing the meaning and the pronunciation appears at the front page. (Shown in the left side of the embedded image)

[Google default dictionary example]

enter image description here

Things I want to scrape and collect automatically are the text of the meaning and the URL in which the mp3 data of the pronunciation is stored. Using the Chrome Inspector manually, these are easily found in its "Elements" section, e.g., the Inspector (shown in the right side of the image) shows the URL, which stores the mp3 data of the pronunciation of "lucid" (here).

However, using requests to get the content of the HTML of the search result and parsing it with BeautifulSoup, like the code below, the soup gets only a few of contents in the panel such as the IPA "/ˈluːsɪd/" and the attribute "adjective" like the result below, and none of the contents I need can be found, such as things in audio elements.

How can I get the information with BeautifulSoup if possible, otherwise what alternative tools are suitable for this task?

P.S. I think the quality of pronunciation from Google dictionary is better than ones from any other dictionary sites. So I want to stick to it.

Code:

import requests
from bs4 import BeautifulSoup

query = "define:lucid"
goog_search = "https://www.google.co.uk/search?q=" + query

r = requests.get(goog_search)

soup = BeautifulSoup(r.text, "html.parser")
print(soup.prettify())

Part of soup content:

           </span>
           <span style="font:smaller 'Doulos SIL','Gentum','TITUS Cyberbit Basic','Junicode','Aborigonal Serif','Arial Unicode MS','Lucida Sans Unicode','Chrysanthi Unicode';padding-left:15px">
            /ˈluːsɪd/
           </span>
          </div>
         </h3>
         <table style="font-size:14px;width:100%">
          <tr>
           <td>
            <div style="color:#666;padding:5px 0">
             adjective
            </div>
like image 427
Kohki Mametani Avatar asked Mar 09 '23 03:03

Kohki Mametani


2 Answers

The basic request you run is not returning the parts of the page rendered via JavaScript. If you right-click in Chrome and select View Page Source the audio link is not there. Solution: you could render the page via selenium. With the below code I get the <audio> tag including the link.

You'll have to pip install selenium, download ChromeDriver and add the folder containing it to PATH like export PATH=$PATH:~/downloads/

import requests
from bs4 import BeautifulSoup
import time
from selenium import webdriver

def render_page(url):
    driver = webdriver.Chrome()
    driver.get(url)
    time.sleep(3)
    r = driver.page_source
    #driver.quit()
    return r

query = "define:lucid"
goog_search = "https://www.google.co.uk/search?q=" + query

r = render_page(goog_search)

soup = BeautifulSoup(r, "html.parser")
print(soup.prettify())
like image 76
M3RS Avatar answered Mar 16 '23 11:03

M3RS


I checked it. You're right, in the BeautifulSoup output there is no audio elements for some reason. However, having inspected the code, I found a source for the audio file which Google is using, which is http://ssl.gstatic.com/dictionary/static/sounds/oxford/lucid--_gb_1.mp3 and which perfectly works if you substitute "lucid" with any other word.

So, if you need to scrape the audio file, you could just do the following:

url='http://ssl.gstatic.com/dictionary/static/sounds/oxford/'    
audio=requests.get(url+'lucid'+'--_gb_1.mp3', stream=True).content
with open('lucid'+'.mp3', 'wb') as f:
     f.write(audio)

As for other elements, I'm afraid you'll need just to find the word "definition" in the soup and scrape the content of the tag that contains it.

like image 36
Alexey Gridnev Avatar answered Mar 16 '23 11:03

Alexey Gridnev