Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting data from hidden html (popup) using BS4

I am trying to scrape the name of a link in a popup in wikipedia. So when you hover a link in wikipedia, it brings up a little snippet from the intro to that link. I need to scrape that information but I am unsure where it would be in the source. When I inspect the element(as it is popped up) this is the html (for this example I am hovering over the link "Greek")

<a dir="ltr" lang="en" class="mwe-popups-extract" href="/wiki/Ancient_Greek"> 
<p>The <b>Ancient Greek</b> language includes the forms of Greek...(a bunch more text)...</p></a> 

What I need to extract is the href which = "/wiki/Ancient_Greek" but this piece of html disappears when I am not hovering the link. Is there a way (with BS4 and python) to extract this information with the source html I am scraping?

EDIT: I can't afford to make additional calls to webpages because the project takes long to run as it is. If there is anyway to change how I am retrieving the source such that I can get the popup information that would be helpful. This project is giant and getting this popup information is crucial.

any suggestions at all that don't require a complete rebuild of the project is extremely appreciated-- I am using urllib to pull source(with requests) and bs4 to scrape through.

like image 235
Pookie Avatar asked Jul 17 '18 13:07

Pookie


2 Answers

In your question you say that you "...can't afford to make additional calls to webpages..." but that's what your browser is doing behind the scenes. The html for the page you are looking at doesn't contain the content that you require.

To demonstrate this:

  1. In your browser, open a Wikipedia page such as Greek.

  2. Bring up the Developer Tools window (Ctrl+Shift+i in Chrome).

  3. Click on the Network tab and make sure that the red button is lit so that all web requests are logged.

  4. Hover over a link in the page such as Ancient Greek.

    You will see that the act of hovering over the link triggers a GET request to the Ancient_Greek summary page.

  5. Click on "Ancient_Greek" in the network tab log to show details of the request.

  6. Click on the Response tab on the right.

    You should see the JSON response containing a field called "extract_html" containing the content you require: "<p>The <b>Ancient Greek</b> language includes the forms...

Therefore, in order to get the information you need, every time you encounter a link to <a href="/wiki/something" /a> you will have to make a GET request to https://en.wikipedia.org/api/rest_v1/page/summary/something

like image 90
stx101 Avatar answered Sep 20 '22 11:09

stx101


With popups and other data that only appears dynamically via javascript, you can't just scrape the data using something like urllib.

You could use a browser controller like splinter or selenium which will allow you to automatically hover over or click things to bring up the popup and then extract its data. After you get the popup html, you can use BS4 to clean it up.

Ex:

from splinter import Browser
browser = Browser()
browser.visit("http://google.com")
button = browser.find_by_name('button1')
button.click()
like image 36
jstein123 Avatar answered Sep 18 '22 11:09

jstein123