Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Beautifulsoup only returing 100 elements

Tags:

I am brand-new to web scraping and want to scrape the player name and salary from spotrac for a university project. What I have done to date is as below.

import requests
from bs4 import BeautifulSoup   

URL = 'https://www.spotrac.com/nfl/rankings/'

reqs = requests.get(URL)
soup = BeautifulSoup(reqs.text, 'lxml')
print("List of all the h1, h2, h3 :")
for my_tag in soup.find_all(class_="team-name"):
    print(my_tag.text)

for my_tag in soup.find_all(class_="info"):
    print(my_tag.text)    

The output of this is only 100 names, but the page has 1000 elements. Is there a reason why this is the case?

like image 1000
user6074035 Avatar asked Aug 12 '20 12:08

user6074035


2 Answers

To get all names and other info, make Ajax POST call to https://www.spotrac.com/nfl/rankings/:

import requests
from bs4 import BeautifulSoup


url = 'https://www.spotrac.com/nfl/rankings/'
data = {
    'ajax': 'true',
    'mobile': 'false'
}

soup = BeautifulSoup(requests.post(url, data=data).content, 'html.parser')
for h3 in soup.select('h3'):
    print(h3.text)
    print(h3.find_next(class_="rank-value").text)
    print('-' * 80)

Prints:

Dak Prescott
$31,409,000  
--------------------------------------------------------------------------------
Russell Wilson
$31,000,000  
--------------------------------------------------------------------------------


...all the way to


--------------------------------------------------------------------------------
Willie Gay Jr.
$958,372  
--------------------------------------------------------------------------------
Jace Sternberger
$956,632  
--------------------------------------------------------------------------------
like image 65
Andrej Kesely Avatar answered Sep 30 '22 19:09

Andrej Kesely


As an addition to Andrej answer.

This technique is called re-engineering HTTP requests. It's the more efficient way of scraping dynamic content, content loaded by javascript.

The alternative to this would be use the selenium package to mimic browser activity. This is slower and more brittle in the long term. The package creates a secure connectiong to make HTTP requests that simulate browser activity.

You can find the requests of the browser which Javascript envokes to provide content on the page. If you inspect the page in chrome, go to network tools --> XHR.

Here you will find all requests that Javascript will envoke. In this case, there are two requests, one to twitter and the one we want. To access the response which is what we're interested in we sometimes need to just make an HTTP get/post request, but sometimes we need to add headers/data/parameters/cookies.

enter image description here

The preview here shows you a snapshot of the data, in this case we get back HTML, which we can then parse using beautifulsoup.

As Andrej points out eloquently you can find this information on the right hand side of the screen when click the headers tab and scroll down. You will find the request headers, you'll find parameters, query's and formdata. In this case, the formdata is where

data = {
    'ajax': 'true',
    'mobile': 'false'
}

Comes from.

You can play about with this, if that doesn't get you the response you want, adding user-agent, headers, other parameters that are part of the request can help. I tend to copy the the request which if you right click the request you can copy the cURL(bash) and input this into curl.trillworks.com.

enter image description here

This presents the request with the headers/data/parameters required to mimic this request. Sometimes it's abit overkill and gives you more data that you actually need to send to get the correct response. But you can play about with all that data and try to get the most minimal request to get the response you want.

like image 43
AaronS Avatar answered Sep 30 '22 21:09

AaronS