I want to scrape data from this page (and pages similar to it): https://cereals.ahdb.org.uk/market-data-centre/historical-data/feed-ingredients.aspx This page uses Power BI. Unfortunately, finding a way to scrape Power BI is hard, because everyone wants to scrape using/into Power BI, not from it. The closest answer was this question. Yet unrelated. Firstly, I used Apache tika, and soon I realized the table data is been loading after loading the page. I need the rendered version of the page. Therefore, I used Selenium. I wanted to <code>Select All</code> at the begining (sending <code>Ctrl+A</code> key combination), but it doesn't work. Maybe it is restricted by the page events (I also tried to remove all the events using developer tools, yet still <code>Ctrl+A</code> doesn't work. I also tried to read the HTML contents, but Power BI puts <code>div</code> elements on the screen using <code>position:absolute</code> and distinguishing the location of a <code>div</code> in the table (both row and column) is an effortful activity. Since Power BI uses JSON, I tried to read data from there. However it is so complicated I couldn't find out the rules. It seems it puts keywords somewhere and uses their indices in the table. Note: I realized that all of the data is not loaded and even shown at the same time. A <code>div</code> of class <code>scroll-bar-part-bar</code> is responsible to act as a scroll bar, and moving that loads/shows other parts of the data. The code I used to read data is as follows. As mentioned, the order of the produced data differs from what is rendered on the browser: <pre class="prettyprint lang-py prettyprint-override"><code>from selenium import webdriver from selenium.webdriver.common.keys import Keys options = webdriver.ChromeOptions() options.binary_location = "C:/Program Files (x86)/Google/Chrome/Application/chrome.exe" driver = webdriver.Chrome(options=options, executable_path="C:/Drivers/chromedriver.exe") driver.get("https://app.powerbi.com/view?r=eyJrIjoiYjVjM2MyNjItZDE1Mi00OWI1LWE5YWYtODY4M2FhYjU4ZDU1IiwidCI6ImExMmNlNTRiLTNkM2QtNDM0Ni05NWVmLWZmMTNjYTVkZDQ3ZCJ9") parent = driver.find_element_by_xpath('//*[@id="pvExplorationHost"]/div/div/div/div[2]/div/div[2]/div[2]/visual-container[4]/div/div[3]/visual/div') children = parent.find_elements_by_xpath('.//*') values = [child.get_attribute('title') for child in children] </code></pre> I appreciate solutions for any of the above problems. The most interesting for me though, is the convention of storing Power BI data in JSON format.

Putting the scroll part and the JSON aside, I managed to read the data. The key is to read all of the elements inside the parent (which is done in the question): <pre class="prettyprint lang-py prettyprint-override"><code>parent = driver.find_element_by_xpath('//*[@id="pvExplorationHost"]/div/div/div/div[2]/div/div[2]/div[2]/visual-container[4]/div/div[3]/visual/div') children = parent.find_elements_by_xpath('.//*') </code></pre> Then sort them using their location: <pre class="prettyprint lang-py prettyprint-override"><code>x = [child.location['x'] for child in children] y = [child.location['y'] for child in children] index = np.lexsort((x,y)) </code></pre> To sort what we have read in different lines, this code may help: <pre class="prettyprint lang-py prettyprint-override"><code>rows = [] row = [] last_line = y[index[0]] for i in index: if last_line != y[i]: row.append[children[i].get_attribute('title')] else: rows.append(row) row = list([children[i].get_attribute('title')] rows.append(row) </code></pre>

Scraping Data from a website which uses Power BI - retrieving data from Power BI on a website

Tags:

python

selenium

web-scraping

powerbi

I want to scrape data from this page (and pages similar to it): https://cereals.ahdb.org.uk/market-data-centre/historical-data/feed-ingredients.aspx

This page uses Power BI. Unfortunately, finding a way to scrape Power BI is hard, because everyone wants to scrape using/into Power BI, not from it. The closest answer was this question. Yet unrelated.

Firstly, I used Apache tika, and soon I realized the table data is been loading after loading the page. I need the rendered version of the page.

Therefore, I used Selenium. I wanted to Select All at the begining (sending Ctrl+A key combination), but it doesn't work. Maybe it is restricted by the page events (I also tried to remove all the events using developer tools, yet still Ctrl+A doesn't work.

I also tried to read the HTML contents, but Power BI puts div elements on the screen using position:absolute and distinguishing the location of a div in the table (both row and column) is an effortful activity.

Since Power BI uses JSON, I tried to read data from there. However it is so complicated I couldn't find out the rules. It seems it puts keywords somewhere and uses their indices in the table.

Note: I realized that all of the data is not loaded and even shown at the same time. A div of class scroll-bar-part-bar is responsible to act as a scroll bar, and moving that loads/shows other parts of the data.

The code I used to read data is as follows. As mentioned, the order of the produced data differs from what is rendered on the browser:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

options = webdriver.ChromeOptions()
options.binary_location = "C:/Program Files (x86)/Google/Chrome/Application/chrome.exe"
driver = webdriver.Chrome(options=options, executable_path="C:/Drivers/chromedriver.exe")

driver.get("https://app.powerbi.com/view?r=eyJrIjoiYjVjM2MyNjItZDE1Mi00OWI1LWE5YWYtODY4M2FhYjU4ZDU1IiwidCI6ImExMmNlNTRiLTNkM2QtNDM0Ni05NWVmLWZmMTNjYTVkZDQ3ZCJ9")
parent = driver.find_element_by_xpath('//*[@id="pvExplorationHost"]/div/div/div/div[2]/div/div[2]/div[2]/visual-container[4]/div/div[3]/visual/div')
children = parent.find_elements_by_xpath('.//*')
values = [child.get_attribute('title') for child in children]

I appreciate solutions for any of the above problems. The most interesting for me though, is the convention of storing Power BI data in JSON format.

334

asked Mar 08 '19 12:03

am.rez

Video Answer

2 Answers

Putting the scroll part and the JSON aside, I managed to read the data. The key is to read all of the elements inside the parent (which is done in the question):

parent = driver.find_element_by_xpath('//*[@id="pvExplorationHost"]/div/div/div/div[2]/div/div[2]/div[2]/visual-container[4]/div/div[3]/visual/div')
children = parent.find_elements_by_xpath('.//*')

Then sort them using their location:

x = [child.location['x'] for child in children]
y = [child.location['y'] for child in children]
index = np.lexsort((x,y))

To sort what we have read in different lines, this code may help:

rows = []
row = []
last_line = y[index[0]]
for i in index:
    if last_line != y[i]:
        row.append[children[i].get_attribute('title')]
    else:
        rows.append(row)
        row = list([children[i].get_attribute('title')]
rows.append(row)

147

answered Oct 17 '22 03:10

am.rez

A few more details about exactly which data you are trying to scrape would have helped to construct a canonical answer. However, to scrape the data within the Commodity and Basis using Selenium, as the the desired element is within an <iframe> so you have to:

Induce WebDriverWait for the desired frame_to_be_available_and_switch_to_it().
Induce WebDriverWait for the desired visibility_of_element_located() for the table.
Induce WebDriverWait for the desired visibility_of_all_elements_located() for the desired data.
You can use the following Locator Strategies:

Code Block:

     from selenium import webdriver
     from selenium.webdriver.common.by import By
     from selenium.webdriver.support.ui import WebDriverWait
     from selenium.webdriver.support import expected_conditions as EC

     options = webdriver.ChromeOptions() 
     options.add_argument("start-maximized")
     options.add_experimental_option("excludeSwitches", ["enable-automation"])
     options.add_experimental_option('useAutomationExtension', False)
     driver = webdriver.Chrome(options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
     driver.get("https://ahdb.org.uk/cereals-oilseeds/feed-ingredient-prices")
     WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.TAG_NAME,"iframe")))
     WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.innerContainer")))
     print("Commodity:")
     print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='pivotTableCellWrap cell-interactive tablixAlignLeft ' and starts-with(@title, 'Ex-')]//parent::div//preceding::div[1]")))])
     print("-=-=-=-=-=-")
     print("Basis:")
     print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.pivotTableCellWrap.cell-interactive.tablixAlignLeft[title^='Ex-']")))])

Console Output:

     Commodity:
     ['Argentine Sunflowermeal 32/33%', 'Maize Gluten Feed', 'Pelleted Wheat Feed', 'Rapemeal (34%)', 'Soyameal (Hi Pro)', 'Soyameal, Brazilian (48%)']
     -=-=-=-=-=-
     Basis:
     ['Ex-Store Liverpool', 'Ex-Store Liverpool', 'Ex-Mill Midlands and Southern Mills', 'Ex-Mill Erith', 'Ex-Store East Coast', 'Ex-Store Liverpool']

Update (as per bounty explanation)

As per your comment as well as the given link on the bounty explanation, to scrape the data from Page 2 within the table under the heading Scouting Location using Selenium, you can use the following solution. For the sake of demonstration I have created a List of first 20 countries and you can expand as much as you wish:

Code Block:

  from selenium import webdriver
  from selenium.webdriver.support.ui import WebDriverWait
  from selenium.webdriver.common.by import By
  from selenium.webdriver.support import expected_conditions as EC

  options = webdriver.ChromeOptions() 
  options.add_argument("start-maximized")
  options.add_experimental_option("excludeSwitches", ["enable-automation"])
  options.add_experimental_option('useAutomationExtension', False)
  driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
  driver.get("https://app.powerbi.com/view?r=eyJrIjoiMzE1ODNmYzQtMWZhYS00NTNjLTg1MDUtOTQ2MGMyNDVkZTY3IiwidCI6IjE2M2FjNDY4LWFiYjgtNDRkMC04MWZkLWQ5ZGIxNWUzYWY5NiIsImMiOjh9")
  WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//span[@class='navigation-wrapper navigation-wrapper-big']//i[@title='Next Page']"))).click()
  print("Country:")
  print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='bodyCells']//div[@class='pivotTableCellWrap cell-interactive ']")))[:20]])
  driver.quit()

Console Output:

  DevTools listening on ws://127.0.0.1:49438/devtools/browser/1b5a2590-5a90-47fd-93c7-cfcf58a6c241
  Country:
  ['Myanmar', 'Myanmar', 'Mozambique', 'Malawi', 'Malawi', 'Mozambique', 'Malawi', 'Malawi', 'Malawi', 'Malawi', 'Malawi', 'Malawi', 'Malawi', 'Malawi', 'Malawi', 'Malawi', 'Malawi', 'Myanmar', 'Myanmar', 'Myanmar']

Console Output Snapshot:

country

answered Oct 17 '22 04:10

undetected Selenium

Related questions
                            
                                Pandas explode multiple columns
                            
                                "zsh: illegal hardware instruction python" when installing Tensorflow on macbook pro M1 [duplicate]
                            
                                Speed up reading multiple pickle files
                            
                                Pure Python library to generate Identicons? [closed]
                            
                                How to make pdb recognize that the source has changed between runs?
                            
                                Gracefully handling "MySQL has gone away"
                            
                                Is there an API for Wireshark, to develop programs/plugins that interact with it/enhance it? [closed]
                            
                                Python Web Framework with best Mongo support
                            
                                Is there a reason why Python's ctypes.CDLL cannot automatically generate restype and argtypes from C header files?
                            
                                How can printing an object result in different output than both str() and repr()?
                            
                                how to perform stable eye corner detection?
                            
                                How to avoid adding duplicates in a many-to-many relationship table in SQLAlchemy - python?
                            
                                What method can I use instead of __file__ in python?
                            
                                How to make a window fullscreen in a secondary display with tkinter?
                            
                                When to open file in binary mode (b)?
                            
                                Custom chained comparisons
                            
                                Zero padding slice past end of array in numpy
                            
                                Keras Binary Classification - Sigmoid activation function
                            
                                What's the best way to ocr as much text as possible from video game screenshots?
                            
                                List all Pipenv environments

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scraping Data from a website which uses Power BI - retrieving data from Power BI on a website

Tags:

python

selenium

web-scraping

powerbi

am.rez

People also ask

Video Answer

2 Answers

am.rez

Update (as per bounty explanation)

undetected Selenium

Recent Activity

Donate For Us