I'm trying to get different college names and their rankings from a webpage. The script I've tried with can parse the first few names and their rankings accordingly.
However, there are 233 names and their rankings in that page but they can only be visible when that page is made to scroll downward. The thing is when the page is scrolled downward, the url is still the same and for that reason I can't create any logic to deal with pagination.
Website address
I do not wish to go for selenium and that is the reason I create this post to solve the same using requests.
I've written so far (grabs the first few records):
import requests
from bs4 import BeautifulSoup
url = 'https://www.usnews.com/best-colleges/rankings/national-liberal-arts-colleges'
r = requests.get(url,headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(r.text,"lxml")
for item in soup.select("[id^='school-']"):
name = item.select_one("[class^='DetailCardColleges__StyledAnchor']").text
rank = item.select_one("[class^='ranklist-ranked-item'] > strong").text
print(name,rank)
How can I parse all the names and their rankings using requests?
The good thing for you is that this page use a JSON API for pagination, so you don't need to even use bs4
, you can just do it with request itself
import requests
url_template = 'https://www.usnews.com/best-colleges/api/search?_sort=rank&_sortDirection=asc&_page={page}&schoolType=national-liberal-arts-colleges'
headers = {
'pragma': 'no-cache',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
'accept': '*/*',
'cache-control': 'no-cache',
'authority': 'www.usnews.com',
'referer': 'https://www.usnews.com/'
}
def scrape_data(data):
print(data)
data = requests.get(url_template.format(page=1), headers=headers).json()
scrape_data(data)
total_pages = data["data"]["totalPages"]
for i in range(2, total_pages + 1):
data = requests.get(url_template.format(page=i), headers=headers).json()
scrape_data(data)
In scrape_data
I have just printed whole data, but you can change as to what data you want to extract from that JSON and scrape the items
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With