BeautifulSoup returns None even though the element exists

Question

I have gone through most of the solutions for similar issues but haven't found one that works and more importantly haven't found an explanation of why this occurs outside of when Javascript or something else is being called on the site being scraped.

I am trying to scrape the table for game "Officials" from the site: http://www.pro-football-reference.com/boxscores/201309050den.htm

my code is:

url = "http://www.pro-football-reference.com/boxscores/201309050den.htm"
html = urlopen(url)    
bsObj = BeautifulSoup(html, "lxml")
officials = bsObj.findAll("table",{"id":"officials"})

for entry in officials:
    print(str(entry))

I am just printing to the console for now, but I get an empty list with findAll or None with find. I have also tried this with the basic html.parser with no luck.

Can someone with a better understanding of html educate me on what is different about this webpage specifically? Thanks in advance!

thebadguy · Accepted Answer

try this code:

from selenium import webdriver
import time
from bs4 import BeautifulSoup


driver = webdriver.Chrome()
url= "http://www.pro-football-reference.com/boxscores/201309050den.htm"
driver.maximize_window()
driver.get(url)

time.sleep(5)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
officials = soup.findAll("table",{"id":"officials"})

for entry in officials:
    print(str(entry))


driver.quit()

It will print:

<table class="suppress_all sortable stats_table now_sortable" data-cols-to-freeze="0" id="officials"><thead><tr class="thead onecell"><td class=" center" colspan="2" data-stat="onecell">Officials</td></tr></thead><caption>Officials Table</caption><tbody>
<tr data-row="0"><th class=" " data-stat="ref_pos" scope="row">Referee</th><td class=" " data-stat="name"><a href="/officials/ColeWa0r.htm">Walt Coleman</a></td></tr>
<tr data-row="1"><th class=" " data-stat="ref_pos" scope="row">Umpire</th><td class=" " data-stat="name"><a href="/officials/ElliRo0r.htm">Roy Ellison</a></td></tr>
<tr data-row="2"><th class=" " data-stat="ref_pos" scope="row">Head Linesman</th><td class=" " data-stat="name"><a href="/officials/BergJe1r.htm">Jerry Bergman</a></td></tr>
<tr data-row="3"><th class=" " data-stat="ref_pos" scope="row">Field Judge</th><td class=" " data-stat="name"><a href="/officials/GautGr0r.htm">Greg Gautreaux</a></td></tr>
<tr data-row="4"><th class=" " data-stat="ref_pos" scope="row">Back Judge</th><td class=" " data-stat="name"><a href="/officials/YettGr0r.htm">Greg Yette</a></td></tr>
<tr data-row="5"><th class=" " data-stat="ref_pos" scope="row">Side Judge</th><td class=" " data-stat="name"><a href="/officials/PattRi0r.htm">Rick Patterson</a></td></tr>
<tr data-row="6"><th class=" " data-stat="ref_pos" scope="row">Line Judge</th><td class=" " data-stat="name"><a href="/officials/BaynRu0r.htm">Rusty Baynes</a></td></tr>
</tbody></table>

Padraic Cunningham · Answer

It is in the source, it is just commented out, it is trivial to removes the comments using a regex:

from bs4 import BeautifulSoup
import requests
import re

url = "http://www.pro-football-reference.com/boxscores/201309050den.htm"
html = requests.get(url).content
bsObj = BeautifulSoup(re.sub("<!--|-->","", html), "lxml")
officials = bsObj.find_all("table",{"id":"officials"})

for entry in officials:
    print(entry)

There is only one table so you don't need find_all and your loop is a bit pointless, just use find:

In [1]: from bs4 import BeautifulSoup
   ...: import requests
   ...: import re
   ...: url = "http://www.pro-football-reference.com/boxscores/201309050den.htm"
   ...: 
   ...: html = requests.get(url).content
   ...: bsObj = BeautifulSoup(re.sub("<!--|-->","", html), "lxml")
   ...: officials = bsObj.find(id="officials")
   ...: print(officials)
   ...: 

<table class="suppress_all sortable stats_table" data-cols-to-freeze="0" id="officials"><caption>Officials Table</caption><tr class="thead onecell"><td class=" center" colspan="2" data-stat="onecell">Officials</td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Referee</th><td class=" " data-stat="name"><a href="/officials/ColeWa0r.htm">Walt Coleman</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Umpire</th><td class=" " data-stat="name"><a href="/officials/ElliRo0r.htm">Roy Ellison</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Head Linesman</th><td class=" " data-stat="name"><a href="/officials/BergJe1r.htm">Jerry Bergman</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Field Judge</th><td class=" " data-stat="name"><a href="/officials/GautGr0r.htm">Greg Gautreaux</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Back Judge</th><td class=" " data-stat="name"><a href="/officials/YettGr0r.htm">Greg Yette</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Side Judge</th><td class=" " data-stat="name"><a href="/officials/PattRi0r.htm">Rick Patterson</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Line Judge</th><td class=" " data-stat="name"><a href="/officials/BaynRu0r.htm">Rusty Baynes</a></td></tr>
</table>

In [2]:

Or Duan · Answer

You don't see it because is not there. Try to turn JS off and open it with you browser, you will see it's not there - the website does some JS DOM manipulation.

You choices are:

In your case, the HTML you want is over there - just in comment, extract it from the comment with beautifulsoup.
Use Selenium or equivalent tool to render the JS(that exactly how your browser does it)

BeautifulSoup returns None even though the element exists

Tags:

python

beautifulsoup

web-scraping

scotche

Video Answer

3 Answers

thebadguy

Padraic Cunningham

Or Duan

Recent Activity

Donate For Us

BeautifulSoup returns None even though the element exists

Tags:

python

beautifulsoup

web-scraping

scotche

Video Answer

3 Answers

thebadguy

Padraic Cunningham

Or Duan

Related questions

Recent Activity

Donate For Us