Script grabs fewer content out of many

Question

I'm trying to get different college names and their rankings from a webpage. The script I've tried with can parse the first few names and their rankings accordingly.

However, there are 233 names and their rankings in that page but they can only be visible when that page is made to scroll downward. The thing is when the page is scrolled downward, the url is still the same and for that reason I can't create any logic to deal with pagination.

Website address

I do not wish to go for selenium and that is the reason I create this post to solve the same using requests.

I've written so far (grabs the first few records):

import requests
from bs4 import BeautifulSoup

url = 'https://www.usnews.com/best-colleges/rankings/national-liberal-arts-colleges'

r = requests.get(url,headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(r.text,"lxml")
for item in soup.select("[id^='school-']"):
    name = item.select_one("[class^='DetailCardColleges__StyledAnchor']").text
    rank = item.select_one("[class^='ranklist-ranked-item'] > strong").text
    print(name,rank)

How can I parse all the names and their rankings using requests?

Tarun Lalwani · Accepted Answer

The good thing for you is that this page use a JSON API for pagination, so you don't need to even use bs4, you can just do it with request itself

import requests

url_template = 'https://www.usnews.com/best-colleges/api/search?_sort=rank&_sortDirection=asc&_page={page}&schoolType=national-liberal-arts-colleges'

headers = {
    'pragma': 'no-cache',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
    'accept': '*/*',
    'cache-control': 'no-cache',
    'authority': 'www.usnews.com',
    'referer': 'https://www.usnews.com/'
}


def scrape_data(data):
    print(data)


data = requests.get(url_template.format(page=1), headers=headers).json()
scrape_data(data)
total_pages = data["data"]["totalPages"]

for i in range(2, total_pages + 1):
    data = requests.get(url_template.format(page=i), headers=headers).json()
    scrape_data(data)

In scrape_data I have just printed whole data, but you can change as to what data you want to extract from that JSON and scrape the items

Script grabs fewer content out of many

Tags:

python

python-3.x

web-scraping

robots.txt

1 Answers

Tarun Lalwani

Recent Activity

Donate For Us

Script grabs fewer content out of many

Tags:

python

python-3.x

web-scraping

robots.txt

1 Answers

Tarun Lalwani

Related questions

Recent Activity

Donate For Us