Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scrape a website with table content that is retrieved by javascript?

I want to scrape a table from a website with a table that looks like this;

<table class="table table-hover data-table sort display">
        <thead>
          <tr>
            <th class="Column1">
            </th>
            <th class="Column2">
            </th>
          </tr>
        </thead>
        <tbody>
          <tr ng-repeat="item in filteredList | orderBy:columnToOrder:reverse">
            <td>{{item.Col1}}</td>
            <td>{{item.Col2}}</td>
          </tr>
        </tbody>
</table>

It seems that this website is built using some javascript framework that retrieves the table content from the backend through web services.

The problem is how can we scrape table data if the data is not in numerical format? The code above have the content enclosed in {{ }}. Does this make the website unscrapable? Any solution? Thank you.

I am using python and beautifulsoup4.

like image 838
guagay_wk Avatar asked Feb 13 '23 05:02

guagay_wk


1 Answers

Usually when there is JS content BeautifulSoup is not the tool. I use selenium. Try this and see if the HTML you are getting is scrapable:

from selenium import webdriver

driver = webdriver.Firefox()
driver.get(url)
driver.set_window_position(0, 0)
driver.set_window_size(100000, 200000)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5) # wait to load

# now print the response
print driver.page_source

At this point, you can use BeautifulSoup to scrape the data out of driver.page_source. Note: you will need to install selenium and Firefox

like image 111
PepperoniPizza Avatar answered Apr 09 '23 16:04

PepperoniPizza