Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trouble parsing tabular items from a graph located in a website

I'm trying to extract the tabular contents available on a graph in a webpage. The content of those tables are only visible when someone hovers his cursor within the area. One such table is this one.

Webpage address

The graph within which the tables are is titled as EPS consensus revisions : last 18 months.

I've tried so far with:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

link = "https://www.marketscreener.com/SUNCORP-GROUP-LTD-6491453/revisions/"

driver = webdriver.Chrome()
driver.get(link)
wait = WebDriverWait(driver, 10)
for items in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#graphRevisionBNAeec span > table tr"))):
    data = [item.text for item in items.find_elements_by_css_selector("td")]
    print(data)
driver.quit()

When I run the above script, It throws thie error raise TimeoutException(message, screen, stacktrace):selenium.common.exceptions.TimeoutException: Message: pointing at this for items in wait.until() line.

Output from a single table out of many should look like:

Period: Thursday, Aug 22, 2019
Number of upgrading estimates: 0
Number of unchanged estimates: 7
Number of Downgrading estimates: 0
High Value: 0.90 AUD
Mean Value: 0.85 AUD
Low Value: 0.77 AUD

How can I get the content of those tables from that graph?

EDIT: I'm still expecting any solution based purely on any browser simulator.

like image 560
MITHU Avatar asked Aug 26 '19 08:08

MITHU


3 Answers

You'll be much better off querying the website's backend directly than using selenium to scrape the frontend for three important reasons:

  1. Speed: Using the API directly is much, much faster and efficient because it only fetches the data you need and doesn't have to wait for javascript to run or pixels to render, and there is no overhead of running a webdriver.

  2. Stability: usually changes to the frontend are much more frequent and hard to follow than changes to the backend. If your code relies on the site's frontend it will probably stop working pretty quickly when they make some UI changes.

  3. Accuracy: sometimes the data displayed in the UI is inaccurate or incomplete. For example, in this website, all numbers are rounded to two decimal points, while the backend sometime provides data more than twice as accurate.

Here's how you could easily use the backend API:

import requests
# API url found using chrome devtools
url = 'https://www.marketscreener.com/charting/afDataFeed.php?codeZB=6491453&t=eec&sub_t=bna&iLang=2'
# We are mocking a chrome browser because the API is blocking python requests apparently
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}
# Make a request to the API and parse the JSON response
data = requests.get(url, headers=headers).json()[0]
# A function to find data for a specific date
def get_vals(date):
    vals = []
    for items in data:
        for item in items:
            if item['t'] == date:
                vals.append(item['y'])
                break
    return vals
# Use the function above with the example table given in the question
print(get_vals('Thursday, Aug 22, 2019'))

Running this outputs the list [0.9, 0.84678, 0.76628, 0, 7, 0], which as you can see is the data you wanted to extract from the table you gave as an example.

like image 158
kmaork Avatar answered Sep 18 '22 00:09

kmaork


Try change this locator:

By.CSS_SELECTOR, "#graphRevisionBNAeec span > table tr"

With this:

By.XPATH, "//*[@class='tabElemNoBor overfH']"

I get a console printed like this:

[u'EPS consensus revisions : last 18 months', u'EPS consensus revisions : last 18 months', u'Number of Estimates\nEPS 2020(AUD)\nNumber of upgrading estimates\nHigh Value\nNumber of unchanged estimates\nMean Value\nNumber of downgrading estimates\nLow Value\nMar 18\nApr 18\nMay 18\nJun 18\nJul 18\nAug 18\nSep 18\nOct 18\nNov 18\nDec 18\nJan 19\nFeb 19\nMar 19\nApr 19\nMay 19\nJun 19\nJul 19\nAug 19\nSep 19\nOct 19\n0\n2\n4\n6\n8\n10\n12\n0.2\n0.4\n0.6\n0.8\n1\n1.2\n1.4\n\xa9marketscreener.com - S&P Global Market Intelligence']
like image 38
frianH Avatar answered Sep 18 '22 00:09

frianH


This is solution using selenium (I tested my code with Firefox, but it work fine whith Chrome):

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains

driver = webdriver.Firefox()
actions = ActionChains(driver)

driver.get("https://www.marketscreener.com/SUNCORP-GROUP-LTD-6491453/revisions/")

table = driver.find_element_by_xpath("//table[@class = 'tabElemNoBor overfH']") #if you want other table, change the XPath
actions.move_to_element(table).perform()

date= WebDriverWait(driver,60).until(EC.presence_of_element_located((By.XPATH, "//table[@class = 'tabElemNoBor overfH']//div[@class = 'highcharts-label highcharts-tooltip highcharts-color-undefined']/span/span//b"))).text
data = WebDriverWait(driver,60).until(EC.presence_of_all_elements_located((By.XPATH, "//table[@class = 'tabElemNoBor overfH']//div[@class = 'highcharts-label highcharts-tooltip highcharts-color-undefined']//td")))
data = [item.get_attribute("innerHTML") for item in data]
data_1 = [data[i] for i in range(len(data)) if i%2==0]
data_2 = [data[i][3:data[i].find("&")] for i in range(len(data)) if i%2==1]
data = list(zip(data_1, data_2))
print(date)
for i in data:
     print(i[0], i[1])

I just trigger the table to generate html code of info table. If you want to change the date, just use mouse move method.

like image 22
brfh Avatar answered Sep 22 '22 00:09

brfh