Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping data from interactive graph

There is a website with a couple of interactive charts from which I would like to extract data. I've written a couple of web scrapers before in python using selenium webdriver, but this seems to be a different problem. I've looked at a couple of similar questions on stackoverflow. From those it seems that the solution could be to download data directly from a json file. I looked at the source code of the website and identified a couple of json files, but upon inspection they don't seem to contain the data.

Does anyone know how to download the data from those graphs? In particular I am interested in this bar chart: .//*[@id='network_download']

Thanks

edit: I should add that when I inspected the website using Firebug I saw that itis possible to get data in the following format. But this is obviously not helpful as it doesn't include any labels.

<circle fill="#8CB1AA" cx="713.4318516666667" cy="5.357142857142858" r="4.5" style="opacity: 0.983087;">
<circle fill="#8CB1AA" cx="694.1212663333334" cy="10.714285714285715" r="4.5" style="opacity: 0.983087;">
<circle fill="#CEA379" cx="626.4726493333333" cy="16.071428571428573" r="4.5" style="opacity: 0.983087;">
<circle fill="#B0B359" cx="613.88416" cy="21.42857142857143" r="4.5" style="opacity: 0.983087;">
<circle fill="#D1D49E" cx="602.917665" cy="26.785714285714285" r="4.5" style="opacity: 0.983087;">
<circle fill="#A5E0B5" cx="581.5437366666666" cy="32.142857142857146" r="4.5" style="opacity: 0.983087;">
like image 514
jonasus Avatar asked Sep 19 '16 10:09

jonasus


People also ask

Is scraping data from YouTube legal?

Most data on YouTube is publicly accessible. Scraping public data from YouTube is legal as long as your scraping activities do not harm the scraped website's operations. It is important not to collect personally identifiable information (PII), and make sure that collected data is stored securely.

Is it OK to scrape data from Google results?

Can you scrape Google search results? Yes. You can scrape Google SERP by using Google Search Scraper tool.


1 Answers

SVG charts like this tend to be a bit tough to scrape. The numbers you want aren't displayed until you actually hover the individual elements with your mouse.

To get the data you need to

  1. Find a list of all dots
  2. For each dot in dots_list, click or hover (action chains) the dot
  3. Scrape the values in the tooltip that pops up

This works for me:

from __future__ import print_function

from pprint import pprint as pp

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains


def main():
    driver = webdriver.Chrome()
    ac = ActionChains(driver)

    try:
        driver.get("https://opensignal.com/reports/2016/02/state-of-lte-q4-2015/")

        dots_css = "div#network_download g g.dots_container circle"
        dots_list = driver.find_elements_by_css_selector(dots_css)

        print("Found {0} data points".format(len(dots_list)))

        download_speeds = list()
        for index, _ in enumerate(dots_list, 1):
            # Because this is an SVG chart, and because we need to hover it,
            # it is very likely that the elements will go stale as we do this. For
            # that reason we need to require each dot element right before we click it
            single_dot_css = dots_css + ":nth-child({0})".format(index)
            dot = driver.find_element_by_css_selector(single_dot_css)
            dot.click()

            # Scrape the text from the popup
            popup_css = "div#network_download div.tooltip"
            popup_text = driver.find_element_by_css_selector(popup_css).text
            pp(popup_text)
            rank, comp_and_country, speed = popup_text.split("\n")
            company, country = comp_and_country.split(" in ")
            speed_dict = {
                "rank": rank.split(" Globally")[0].strip("#"),
                "company": company,
                "country": country,
                "speed": speed.split("Download speed: ")[1]
            }
            download_speeds.append(speed_dict)

            # Hover away from the tool tip so it clears
            hover_elem = driver.find_element_by_id("network_download")
            ac.move_to_element(hover_elem).perform()

        pp(download_speeds)

    finally:
        driver.quit()

if __name__ == "__main__":
    main()

Sample Output:

(.venv35) ➜  stackoverflow python svg_charts.py
Found 182 data points
'#1 Globally\nSingTel in Singapore\nDownload speed: 40 Mbps'
'#2 Globally\nStarHub in Singapore\nDownload speed: 39 Mbps'
'#3 Globally\nSaskTel in Canada\nDownload speed: 35 Mbps'
'#4 Globally\nOrange in Israel\nDownload speed: 35 Mbps'
'#5 Globally\nolleh in South Korea\nDownload speed: 34 Mbps'
'#6 Globally\nVodafone in Romania\nDownload speed: 33 Mbps'
'#7 Globally\nVodafone in New Zealand\nDownload speed: 32 Mbps'
'#8 Globally\nTDC in Denmark\nDownload speed: 31 Mbps'
'#9 Globally\nT-Mobile in Hungary\nDownload speed: 30 Mbps'
'#10 Globally\nT-Mobile in Netherlands\nDownload speed: 30 Mbps'
'#11 Globally\nM1 in Singapore\nDownload speed: 29 Mbps'
'#12 Globally\nTelstra in Australia\nDownload speed: 29 Mbps'
'#13 Globally\nTelenor in Hungary\nDownload speed: 29 Mbps'
<...>
[{'company': 'SingTel',
  'country': 'Singapore',
  'rank': '1',
  'speed': '40 Mbps'},
 {'company': 'StarHub',
  'country': 'Singapore',
  'rank': '2',
  'speed': '39 Mbps'},
 {'company': 'SaskTel', 'country': 'Canada', 'rank': '3', 'speed': '35 Mbps'}
...
]

It should be noted that the values you referenced in the question, in the circle elements, aren't particularly useful, as those just specify how to draw the dots within the SVG chart.

like image 84
Levi Noecker Avatar answered Oct 20 '22 15:10

Levi Noecker