Scraping difficult table

Question

I have been trying to scrape a table from here for quite some time but have been unsuccessful. The table I am trying to scrape is titled "Team Per Game Stats". I am confident that once I am able to scrape one element of that table that I can iterate through the columns I want from the list and eventually end up with a pandas data frame.

Here is my code so far:

from bs4 import BeautifulSoup
import requests

# url that we are scraping
r = requests.get('https://www.basketball-reference.com/leagues/NBA_2019.html')
# Lets look at what the request content looks like
print(r.content)

# use Beautifulsoup on content from request
c = r.content
soup = BeautifulSoup(c)
print(soup)

# using prettify() in Beautiful soup indents HTML like it should be in the web page
# This can make reading the HTML a little be easier
print(soup.prettify())

# get elements within the 'main-content' tag
team_per_game = soup.find(id="all_team-stats-per_game")
print(team_per_game)

Any help would be greatly appreciated.

Martijn Pieters · Accepted Answer

That webpage employs a trick to try to stop search engines and other automated web clients (including scrapers) from finding the table data: the tables are stored in HTML comments:

<div id="all_team-stats-per_game" class="table_wrapper setup_commented commented">

<div class="section_heading">
  <span class="section_anchor" id="team-stats-per_game_link" data-label="Team Per Game Stats"></span><h2>Team Per Game Stats</h2>    <div class="section_heading_text">
      <ul> <li><small>* Playoff teams</small></li>
      </ul>
    </div>      
</div>
<div class="placeholder"></div>
<!--
   <div class="table_outer_container">
      <div class="overthrow table_container" id="div_team-stats-per_game">
  <table class="sortable stats_table" id="team-stats-per_game" data-cols-to-freeze=2><caption>Team Per Game Stats Table</caption>

...

</table>

      </div>
   </div>
-->
</div>

I note that the opening div has setup_commented and commented classes. Javascript code included in the page is then executed by your browser that then loads the text from those comments and replaces the placeholder div with the contents as new HTML for the browser to display.

You can extract the comment text here:

from bs4 import BeautifulSoup, Comment

soup = BeautifulSoup(r.content, 'lxml')
placeholder = soup.select_one('#all_team-stats-per_game .placeholder')
comment = next(elem for elem in placeholder.next_siblings if isinstance(elem, Comment))
table_soup = BeautifulSoup(comment, 'lxml')

then continue to parse the table HTML.

This specific site has published both terms of use, and a page on data use you should probably read if you are going to use their data. Specifically, their terms state, under section 6. Site Content:

You may not frame, capture, harvest, or collect any part of the Site or Content without SRL's advance written consent.

Scraping the data would fall under that heading.

Scraping difficult table

Tags:

python

beautifulsoup

web-scraping

Aaron England

1 Answers

Martijn Pieters

Recent Activity

Donate For Us

Scraping difficult table

Tags:

python

beautifulsoup

web-scraping

Aaron England

1 Answers

Martijn Pieters

Related questions

Recent Activity

Donate For Us