Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping difficult table

I have been trying to scrape a table from here for quite some time but have been unsuccessful. The table I am trying to scrape is titled "Team Per Game Stats". I am confident that once I am able to scrape one element of that table that I can iterate through the columns I want from the list and eventually end up with a pandas data frame.

Here is my code so far:

from bs4 import BeautifulSoup
import requests

# url that we are scraping
r = requests.get('https://www.basketball-reference.com/leagues/NBA_2019.html')
# Lets look at what the request content looks like
print(r.content)

# use Beautifulsoup on content from request
c = r.content
soup = BeautifulSoup(c)
print(soup)

# using prettify() in Beautiful soup indents HTML like it should be in the web page
# This can make reading the HTML a little be easier
print(soup.prettify())

# get elements within the 'main-content' tag
team_per_game = soup.find(id="all_team-stats-per_game")
print(team_per_game)

Any help would be greatly appreciated.

like image 414
Aaron England Avatar asked Jan 20 '26 22:01

Aaron England


1 Answers

That webpage employs a trick to try to stop search engines and other automated web clients (including scrapers) from finding the table data: the tables are stored in HTML comments:

<div id="all_team-stats-per_game" class="table_wrapper setup_commented commented">

<div class="section_heading">
  <span class="section_anchor" id="team-stats-per_game_link" data-label="Team Per Game Stats"></span><h2>Team Per Game Stats</h2>    <div class="section_heading_text">
      <ul> <li><small>* Playoff teams</small></li>
      </ul>
    </div>      
</div>
<div class="placeholder"></div>
<!--
   <div class="table_outer_container">
      <div class="overthrow table_container" id="div_team-stats-per_game">
  <table class="sortable stats_table" id="team-stats-per_game" data-cols-to-freeze=2><caption>Team Per Game Stats Table</caption>

...

</table>

      </div>
   </div>
-->
</div>

I note that the opening div has setup_commented and commented classes. Javascript code included in the page is then executed by your browser that then loads the text from those comments and replaces the placeholder div with the contents as new HTML for the browser to display.

You can extract the comment text here:

from bs4 import BeautifulSoup, Comment

soup = BeautifulSoup(r.content, 'lxml')
placeholder = soup.select_one('#all_team-stats-per_game .placeholder')
comment = next(elem for elem in placeholder.next_siblings if isinstance(elem, Comment))
table_soup = BeautifulSoup(comment, 'lxml')

then continue to parse the table HTML.

This specific site has published both terms of use, and a page on data use you should probably read if you are going to use their data. Specifically, their terms state, under section 6. Site Content:

You may not frame, capture, harvest, or collect any part of the Site or Content without SRL's advance written consent.

Scraping the data would fall under that heading.

like image 125
Martijn Pieters Avatar answered Jan 22 '26 11:01

Martijn Pieters



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!