There is a website with a couple of interactive charts from which I would like to extract data. I've written a couple of web scrapers before in python using selenium webdriver, but this seems to be a different problem. I've looked at a couple of similar questions on stackoverflow. From those it seems that the solution could be to download data directly from a json file. I looked at the source code of the website and identified a couple of json files, but upon inspection they don't seem to contain the data.
Does anyone know how to download the data from those graphs? In particular I am interested in this bar chart: .//*[@id='network_download']
Thanks
edit: I should add that when I inspected the website using Firebug I saw that itis possible to get data in the following format. But this is obviously not helpful as it doesn't include any labels.
<circle fill="#8CB1AA" cx="713.4318516666667" cy="5.357142857142858" r="4.5" style="opacity: 0.983087;">
<circle fill="#8CB1AA" cx="694.1212663333334" cy="10.714285714285715" r="4.5" style="opacity: 0.983087;">
<circle fill="#CEA379" cx="626.4726493333333" cy="16.071428571428573" r="4.5" style="opacity: 0.983087;">
<circle fill="#B0B359" cx="613.88416" cy="21.42857142857143" r="4.5" style="opacity: 0.983087;">
<circle fill="#D1D49E" cx="602.917665" cy="26.785714285714285" r="4.5" style="opacity: 0.983087;">
<circle fill="#A5E0B5" cx="581.5437366666666" cy="32.142857142857146" r="4.5" style="opacity: 0.983087;">
Most data on YouTube is publicly accessible. Scraping public data from YouTube is legal as long as your scraping activities do not harm the scraped website's operations. It is important not to collect personally identifiable information (PII), and make sure that collected data is stored securely.
Can you scrape Google search results? Yes. You can scrape Google SERP by using Google Search Scraper tool.
SVG charts like this tend to be a bit tough to scrape. The numbers you want aren't displayed until you actually hover the individual elements with your mouse.
To get the data you need to
This works for me:
from __future__ import print_function
from pprint import pprint as pp
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
def main():
driver = webdriver.Chrome()
ac = ActionChains(driver)
try:
driver.get("https://opensignal.com/reports/2016/02/state-of-lte-q4-2015/")
dots_css = "div#network_download g g.dots_container circle"
dots_list = driver.find_elements_by_css_selector(dots_css)
print("Found {0} data points".format(len(dots_list)))
download_speeds = list()
for index, _ in enumerate(dots_list, 1):
# Because this is an SVG chart, and because we need to hover it,
# it is very likely that the elements will go stale as we do this. For
# that reason we need to require each dot element right before we click it
single_dot_css = dots_css + ":nth-child({0})".format(index)
dot = driver.find_element_by_css_selector(single_dot_css)
dot.click()
# Scrape the text from the popup
popup_css = "div#network_download div.tooltip"
popup_text = driver.find_element_by_css_selector(popup_css).text
pp(popup_text)
rank, comp_and_country, speed = popup_text.split("\n")
company, country = comp_and_country.split(" in ")
speed_dict = {
"rank": rank.split(" Globally")[0].strip("#"),
"company": company,
"country": country,
"speed": speed.split("Download speed: ")[1]
}
download_speeds.append(speed_dict)
# Hover away from the tool tip so it clears
hover_elem = driver.find_element_by_id("network_download")
ac.move_to_element(hover_elem).perform()
pp(download_speeds)
finally:
driver.quit()
if __name__ == "__main__":
main()
Sample Output:
(.venv35) ➜ stackoverflow python svg_charts.py
Found 182 data points
'#1 Globally\nSingTel in Singapore\nDownload speed: 40 Mbps'
'#2 Globally\nStarHub in Singapore\nDownload speed: 39 Mbps'
'#3 Globally\nSaskTel in Canada\nDownload speed: 35 Mbps'
'#4 Globally\nOrange in Israel\nDownload speed: 35 Mbps'
'#5 Globally\nolleh in South Korea\nDownload speed: 34 Mbps'
'#6 Globally\nVodafone in Romania\nDownload speed: 33 Mbps'
'#7 Globally\nVodafone in New Zealand\nDownload speed: 32 Mbps'
'#8 Globally\nTDC in Denmark\nDownload speed: 31 Mbps'
'#9 Globally\nT-Mobile in Hungary\nDownload speed: 30 Mbps'
'#10 Globally\nT-Mobile in Netherlands\nDownload speed: 30 Mbps'
'#11 Globally\nM1 in Singapore\nDownload speed: 29 Mbps'
'#12 Globally\nTelstra in Australia\nDownload speed: 29 Mbps'
'#13 Globally\nTelenor in Hungary\nDownload speed: 29 Mbps'
<...>
[{'company': 'SingTel',
'country': 'Singapore',
'rank': '1',
'speed': '40 Mbps'},
{'company': 'StarHub',
'country': 'Singapore',
'rank': '2',
'speed': '39 Mbps'},
{'company': 'SaskTel', 'country': 'Canada', 'rank': '3', 'speed': '35 Mbps'}
...
]
It should be noted that the values you referenced in the question, in the circle elements, aren't particularly useful, as those just specify how to draw the dots within the SVG chart.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With