Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping Data from a website which uses Power BI - retrieving data from Power BI on a website

I want to scrape data from this page (and pages similar to it): https://cereals.ahdb.org.uk/market-data-centre/historical-data/feed-ingredients.aspx

This page uses Power BI. Unfortunately, finding a way to scrape Power BI is hard, because everyone wants to scrape using/into Power BI, not from it. The closest answer was this question. Yet unrelated.

Firstly, I used Apache tika, and soon I realized the table data is been loading after loading the page. I need the rendered version of the page.

Therefore, I used Selenium. I wanted to Select All at the begining (sending Ctrl+A key combination), but it doesn't work. Maybe it is restricted by the page events (I also tried to remove all the events using developer tools, yet still Ctrl+A doesn't work.

I also tried to read the HTML contents, but Power BI puts div elements on the screen using position:absolute and distinguishing the location of a div in the table (both row and column) is an effortful activity.

Since Power BI uses JSON, I tried to read data from there. However it is so complicated I couldn't find out the rules. It seems it puts keywords somewhere and uses their indices in the table.

Note: I realized that all of the data is not loaded and even shown at the same time. A div of class scroll-bar-part-bar is responsible to act as a scroll bar, and moving that loads/shows other parts of the data.

The code I used to read data is as follows. As mentioned, the order of the produced data differs from what is rendered on the browser:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

options = webdriver.ChromeOptions()
options.binary_location = "C:/Program Files (x86)/Google/Chrome/Application/chrome.exe"
driver = webdriver.Chrome(options=options, executable_path="C:/Drivers/chromedriver.exe")

driver.get("https://app.powerbi.com/view?r=eyJrIjoiYjVjM2MyNjItZDE1Mi00OWI1LWE5YWYtODY4M2FhYjU4ZDU1IiwidCI6ImExMmNlNTRiLTNkM2QtNDM0Ni05NWVmLWZmMTNjYTVkZDQ3ZCJ9")
parent = driver.find_element_by_xpath('//*[@id="pvExplorationHost"]/div/div/div/div[2]/div/div[2]/div[2]/visual-container[4]/div/div[3]/visual/div')
children = parent.find_elements_by_xpath('.//*')
values = [child.get_attribute('title') for child in children]

I appreciate solutions for any of the above problems. The most interesting for me though, is the convention of storing Power BI data in JSON format.

like image 334
am.rez Avatar asked Mar 08 '19 12:03

am.rez


People also ask

How do you transfer data from a website to Power BI?

In the Power BI Desktop Home ribbon tab, drop down the arrow next to Get Data, and then select Web. You can also select the Get Data item itself, or select Get Data from the Power BI Desktop get started dialog, then select Web from the All or Other section of the Get Data dialog, and then select Connect.

How do I extract data from Microsoft Power BI?

Export data from a Power BI dashboardOpen a dashboard in the Power BI service and select a tile with a visual. From the upper right corner of the tile, open the More options (...) dropdown and select Export to . csv.

Can you use a Power BI report as a data source?

Power BI reports can connect to a number of data sources. Depending on how data is used, different data sources are available. Data can be imported or data can be queried directly using DirectQuery, or a live connection to SQL Server Analysis Services.

Can you do data mining with Power BI?

For data mining, the strength of Microsoft Power BI is the ability to easily capture and assemble data from diverse sources, particularly other Microsoft tools and platforms. The BI solution offers dashboards that deliver interactive panels.


Video Answer


2 Answers

Putting the scroll part and the JSON aside, I managed to read the data. The key is to read all of the elements inside the parent (which is done in the question):

parent = driver.find_element_by_xpath('//*[@id="pvExplorationHost"]/div/div/div/div[2]/div/div[2]/div[2]/visual-container[4]/div/div[3]/visual/div')
children = parent.find_elements_by_xpath('.//*')

Then sort them using their location:

x = [child.location['x'] for child in children]
y = [child.location['y'] for child in children]
index = np.lexsort((x,y))

To sort what we have read in different lines, this code may help:

rows = []
row = []
last_line = y[index[0]]
for i in index:
    if last_line != y[i]:
        row.append[children[i].get_attribute('title')]
    else:
        rows.append(row)
        row = list([children[i].get_attribute('title')]
rows.append(row)
like image 147
am.rez Avatar answered Oct 17 '22 03:10

am.rez


A few more details about exactly which data you are trying to scrape would have helped to construct a canonical answer. However, to scrape the data within the Commodity and Basis using Selenium, as the the desired element is within an <iframe> so you have to:

  • Induce WebDriverWait for the desired frame_to_be_available_and_switch_to_it().

  • Induce WebDriverWait for the desired visibility_of_element_located() for the table.

  • Induce WebDriverWait for the desired visibility_of_all_elements_located() for the desired data.

  • You can use the following Locator Strategies:

  • Code Block:

         from selenium import webdriver
         from selenium.webdriver.common.by import By
         from selenium.webdriver.support.ui import WebDriverWait
         from selenium.webdriver.support import expected_conditions as EC
    
         options = webdriver.ChromeOptions() 
         options.add_argument("start-maximized")
         options.add_experimental_option("excludeSwitches", ["enable-automation"])
         options.add_experimental_option('useAutomationExtension', False)
         driver = webdriver.Chrome(options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
         driver.get("https://ahdb.org.uk/cereals-oilseeds/feed-ingredient-prices")
         WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.TAG_NAME,"iframe")))
         WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.innerContainer")))
         print("Commodity:")
         print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='pivotTableCellWrap cell-interactive tablixAlignLeft ' and starts-with(@title, 'Ex-')]//parent::div//preceding::div[1]")))])
         print("-=-=-=-=-=-")
         print("Basis:")
         print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.pivotTableCellWrap.cell-interactive.tablixAlignLeft[title^='Ex-']")))])
    
  • Console Output:

         Commodity:
         ['Argentine Sunflowermeal 32/33%', 'Maize Gluten Feed', 'Pelleted Wheat Feed', 'Rapemeal (34%)', 'Soyameal (Hi Pro)', 'Soyameal, Brazilian (48%)']
         -=-=-=-=-=-
         Basis:
         ['Ex-Store Liverpool', 'Ex-Store Liverpool', 'Ex-Mill Midlands and Southern Mills', 'Ex-Mill Erith', 'Ex-Store East Coast', 'Ex-Store Liverpool']
    

Update (as per bounty explanation)

As per your comment as well as the given link on the bounty explanation, to scrape the data from Page 2 within the table under the heading Scouting Location using Selenium, you can use the following solution. For the sake of demonstration I have created a List of first 20 countries and you can expand as much as you wish:

  • Code Block:

      from selenium import webdriver
      from selenium.webdriver.support.ui import WebDriverWait
      from selenium.webdriver.common.by import By
      from selenium.webdriver.support import expected_conditions as EC
    
      options = webdriver.ChromeOptions() 
      options.add_argument("start-maximized")
      options.add_experimental_option("excludeSwitches", ["enable-automation"])
      options.add_experimental_option('useAutomationExtension', False)
      driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
      driver.get("https://app.powerbi.com/view?r=eyJrIjoiMzE1ODNmYzQtMWZhYS00NTNjLTg1MDUtOTQ2MGMyNDVkZTY3IiwidCI6IjE2M2FjNDY4LWFiYjgtNDRkMC04MWZkLWQ5ZGIxNWUzYWY5NiIsImMiOjh9")
      WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//span[@class='navigation-wrapper navigation-wrapper-big']//i[@title='Next Page']"))).click()
      print("Country:")
      print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='bodyCells']//div[@class='pivotTableCellWrap cell-interactive ']")))[:20]])
      driver.quit()
    
  • Console Output:

      DevTools listening on ws://127.0.0.1:49438/devtools/browser/1b5a2590-5a90-47fd-93c7-cfcf58a6c241
      Country:
      ['Myanmar', 'Myanmar', 'Mozambique', 'Malawi', 'Malawi', 'Mozambique', 'Malawi', 'Malawi', 'Malawi', 'Malawi', 'Malawi', 'Malawi', 'Malawi', 'Malawi', 'Malawi', 'Malawi', 'Malawi', 'Myanmar', 'Myanmar', 'Myanmar']
    
  • Console Output Snapshot:

country

like image 36
undetected Selenium Avatar answered Oct 17 '22 04:10

undetected Selenium