So I want to extract "bilibili-player-video-info-people-number" from this link: https://www.bilibili.com/video/BV1a44y167wK. When I create my beautifulsoup object and search it, this class is not there. Is it due to the parser? I did try lxml and html5lib but neither did any better.
<span class="bilibili-player-video-info-people-number">585</span>
That's the full element that I want to extract - the number updates every minute to show how many people are viewing currently.
import time
from bs4 import BeautifulSoup
from selenium import webdriver
import re
import html5lib
driver = webdriver.Chrome(r'C:\Users\Rob\Downloads\chromedriver.exe')
driver.get('https://www.bilibili.com/video/BV1a44y167wK')
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content, 'html5lib')
viewers = soup.findAll('span', class_='bilibili-player-video-info-people-text')
print(viewers[0])
print(viewers[0]) returns an out of range error as there is nothing in the viewers object.
Thank you!
Almost the entire site is behind JavaScript so bs4 is useless, unless the element you want is in the requested HTML. In your case, it's not.
However, there's an API endpoint that you can query that carries this data (and much more).
With a bit of regex and requests you can get the online count (of viewers).
Here's how:
import re
import requests
with requests.Session() as connection:
page_url = "https://www.bilibili.com/video/BV1a44y167wK"
page = connection.get(page_url).text
cid = re.search(r"cid\":(\d+),\"page", page).group(1)
aid = re.search(r"aid\":(\d+),", page).group(1)
url = f"https://api.bilibili.com/x/player/v2?cid={cid}&aid={aid}&bvid={page_url.rsplit('/', 1)[-1]}"
print(connection.get(url).json()["data"]["online_count"])
Output (note: it might change, as viewers come and go):
562
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With