Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Script grabs fewer content out of many

I'm trying to get different college names and their rankings from a webpage. The script I've tried with can parse the first few names and their rankings accordingly.

However, there are 233 names and their rankings in that page but they can only be visible when that page is made to scroll downward. The thing is when the page is scrolled downward, the url is still the same and for that reason I can't create any logic to deal with pagination.

Website address

I do not wish to go for selenium and that is the reason I create this post to solve the same using requests.

I've written so far (grabs the first few records):

import requests
from bs4 import BeautifulSoup

url = 'https://www.usnews.com/best-colleges/rankings/national-liberal-arts-colleges'

r = requests.get(url,headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(r.text,"lxml")
for item in soup.select("[id^='school-']"):
    name = item.select_one("[class^='DetailCardColleges__StyledAnchor']").text
    rank = item.select_one("[class^='ranklist-ranked-item'] > strong").text
    print(name,rank)

How can I parse all the names and their rankings using requests?

like image 261
robots.txt Avatar asked Jul 24 '19 20:07

robots.txt


1 Answers

The good thing for you is that this page use a JSON API for pagination, so you don't need to even use bs4, you can just do it with request itself

import requests

url_template = 'https://www.usnews.com/best-colleges/api/search?_sort=rank&_sortDirection=asc&_page={page}&schoolType=national-liberal-arts-colleges'

headers = {
    'pragma': 'no-cache',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
    'accept': '*/*',
    'cache-control': 'no-cache',
    'authority': 'www.usnews.com',
    'referer': 'https://www.usnews.com/'
}


def scrape_data(data):
    print(data)


data = requests.get(url_template.format(page=1), headers=headers).json()
scrape_data(data)
total_pages = data["data"]["totalPages"]

for i in range(2, total_pages + 1):
    data = requests.get(url_template.format(page=i), headers=headers).json()
    scrape_data(data)

In scrape_data I have just printed whole data, but you can change as to what data you want to extract from that JSON and scrape the items

like image 130
Tarun Lalwani Avatar answered Sep 22 '22 01:09

Tarun Lalwani