Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retrieve a number from a span tag, using Python requests and Beautiful Soup

I'm new to python and html. I am trying to retrieve the number of comments from a page using requests and BeautifulSoup.

In this example I am trying to get the number 226. Here is the code as I can see it when I inspect the page in Chrome:

<a title="Go to the comments page" class="article__comments-counts" href="http://www.theglobeandmail.com/opinion/will-kevin-oleary-be-stopped/article33519766/comments/">
    <span class="civil-comment-count" data-site-id="globeandmail" data-id="33519766" data-language="en">
    226
    </span>
    Comments
</a>

When I request the text from the URL, I can find the code but there is no content between the span tags, no 226. Here is my code:

import requests, bs4

url = 'http://www.theglobeandmail.com/opinion/will-kevin-oleary-be-stopped/article33519766/'
r = requests.get()
soup = bs4.BeautifulSoup(r.text, 'html.parser')

span = soup.find('span', class_='civil-comment-count')

It returns this, same as the above but no 226.

<span class="civil-comment-count" data-id="33519766" data-language="en" data-site-id="globeandmail">
</span>

I'm at a loss as to why the value isn't appearing. Thank you in advance for any assistance.

like image 707
morgonhorn Avatar asked Jan 08 '17 02:01

morgonhorn


2 Answers

The page, and specifically the number of comments, does involve JavaScript to be loaded and shown. But, you don't have to use Selenium, make a request to the API behind it:

import requests

with requests.Session() as session:
    session.headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36"}

    # visit main page
    base_url = 'http://www.theglobeandmail.com/opinion/will-kevin-oleary-be-stopped/article33519766/'
    session.get(base_url)

    # get the comments count
    url = "https://api-civilcomments.global.ssl.fastly.net/api/v1/topics/multiple_comments_count.json"
    params = {"publication_slug": "globeandmail",
              "reference_language": "en",
              "reference_ids": "33519766"}
    r = session.get(url, params=params)
    print(r.json())

Prints:

{'comment_counts': {'33519766': 226}}
like image 183
alecxe Avatar answered Oct 10 '22 09:10

alecxe


This page use JavaScript to get the comment number, this is what the page look like when disable the JavaScript: enter image description here

You can find the real url which contains the number in Chrome's Developer tools: enter image description here

Than you can mimic the requests using @alecxe code.

like image 27
宏杰李 Avatar answered Oct 10 '22 08:10

宏杰李