I'm new to python and html. I am trying to retrieve the number of comments from a page using requests and BeautifulSoup.
In this example I am trying to get the number 226. Here is the code as I can see it when I inspect the page in Chrome:
<a title="Go to the comments page" class="article__comments-counts" href="http://www.theglobeandmail.com/opinion/will-kevin-oleary-be-stopped/article33519766/comments/">
<span class="civil-comment-count" data-site-id="globeandmail" data-id="33519766" data-language="en">
226
</span>
Comments
</a>
When I request the text from the URL, I can find the code but there is no content between the span tags, no 226. Here is my code:
import requests, bs4
url = 'http://www.theglobeandmail.com/opinion/will-kevin-oleary-be-stopped/article33519766/'
r = requests.get()
soup = bs4.BeautifulSoup(r.text, 'html.parser')
span = soup.find('span', class_='civil-comment-count')
It returns this, same as the above but no 226.
<span class="civil-comment-count" data-id="33519766" data-language="en" data-site-id="globeandmail">
</span>
I'm at a loss as to why the value isn't appearing. Thank you in advance for any assistance.
The page, and specifically the number of comments, does involve JavaScript to be loaded and shown. But, you don't have to use Selenium, make a request to the API behind it:
import requests
with requests.Session() as session:
session.headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36"}
# visit main page
base_url = 'http://www.theglobeandmail.com/opinion/will-kevin-oleary-be-stopped/article33519766/'
session.get(base_url)
# get the comments count
url = "https://api-civilcomments.global.ssl.fastly.net/api/v1/topics/multiple_comments_count.json"
params = {"publication_slug": "globeandmail",
"reference_language": "en",
"reference_ids": "33519766"}
r = session.get(url, params=params)
print(r.json())
Prints:
{'comment_counts': {'33519766': 226}}
This page use JavaScript to get the comment number, this is what the page look like when disable the JavaScript:
You can find the real url which contains the number in Chrome's Developer tools:
Than you can mimic the requests using @alecxe code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With