How to scrape through Single page Application websites in python using bs4

Tags:

I am scraping players name through the NBA website. The player's name webpage is designed using a single page application. The Players are distributed across several pages in alphabetical order. I am unable to extract the names of all the players. Here is the link: https://in.global.nba.com/playerindex/

from selenium import webdriver
from bs4 import BeautifulSoup

class make():
    def __init__(self):
        self.first=""
        self.last=""

driver= webdriver.PhantomJS(executable_path=r'E:\Downloads\Compressed\phantomjs-2.1.1-windows\bin\phantomjs.exe')

driver.get('https://in.global.nba.com/playerindex/')

html_doc = driver.page_source


soup = BeautifulSoup(html_doc,'lxml')

names = []

layer = soup.find_all("a",class_="player-name ng-isolate-scope")
for a in layer:
    span = a.find("span",class_="ng-binding")
    thing = make()
    thing.first = span.text
    spans = a.find("span",class_="ng-binding").find_next_sibling()
    thing.last = spans.text
    names.append(thing)

455

asked Jul 16 '19 17:07

Saurabh Rawat

1 Answers

When dealing with SPAs, you shouldn't try to extract info from DOM, because the DOM is incomplete without running a JS-capable browser to populate it with data. Open up the page source, and you'll see the page HTML doesn't have the data you need.

But most SPAs load their data using XHR requests. You can monitor network requests in Developer Console (F12) to see the requests being made during page load.

Here https://in.global.nba.com/playerindex/ loads player list from https://in.global.nba.com/stats2/league/playerlist.json?locale=en

Simulate that request yourself, then pick whatever you need. Inspect the request headers to figure out what you need to send with the request.

import requests

if __name__ == '__main__':
    page_url = 'https://in.global.nba.com/playerindex/'
    s = requests.Session()
    s.headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0'}

    # visit the homepage to populate session with necessary cookies
    res = s.get(page_url)
    res.raise_for_status()

    json_url = 'https://in.global.nba.com/stats2/league/playerlist.json?locale=en'
    res = s.get(json_url)
    res.raise_for_status()
    data = res.json()

    player_names = [p['playerProfile']['displayName'] for p in data['payload']['players']]
    print(player_names)

output:

['Steven Adams', 'Bam Adebayo', 'Deng Adel', 'LaMarcus Aldridge', 'Kyle Alexander', 'Nickeil Alexander-Walker', ...

Dealing with auth

One thing to watch out for is that some websites require an authorization token to be sent with requests. You can see it in the API requests if it's present.

If you're building a scraper that needs to be functional in the long(er) term, you might want to make the script more robust by extracting the token from the page and including it in requests.

This token (mostly a JWT token, starts with ey...) usually sits somewhere in the HTML, encoded as JSON. Or it is sent to the client as a cookie, and the browser attaches it to the request, or in a header. In short, it could be anywhere. Scan the requests & responses to figure out where the token is coming from and how you can retrieve it yourself.

...
<script>
const state = {"token": "ey......", ...};
</script>

import json
import re

res = requests.get('url/to/page')

# extract the token from the page. Here `state` is an object serialized as JSON,
# we take everything after `=` sign until the semicolon and deserialize it
state = json.loads(re.search(r'const state = (.*);', res.text).group(1))
token = state['token']

res = requests.get('url/to/api/with/auth', headers={'authorization': f'Bearer {token}'})

171

answered Sep 22 '22 07:09

abdusco

Related questions
                            
                                Compare headers of dataframes in pandas
                            
                                How to repeat each of a Python list's elements n times with itertools only?
                            
                                How to reverse_lazy to a view/url with variable?
                            
                                Cannot pickle lambda function in python 3
                            
                                Where to find Python implementation of Chaikin's corner cutting algorithm?
                            
                                Flask-admin batch action with argument via pop-up modal window
                            
                                GDAX / Coinbase API authentication process: Unicode-objects must be encoded before hashing
                            
                                Python3 add logging level
                            
                                Python with Visual Studio Code - Run specific file
                            
                                Correlation coefficient of two columns in pandas dataframe with .corr()
                            
                                I want to flatten JSON column in a Pandas DataFrame
                            
                                python argparse how to get entire command as string
                            
                                Updating cell values with formulas results in apostrophe prefixes with Sheets API
                            
                                ImportError: No module named 'flask_sqlalchemy' w/ 2 Versions of Python Installed
                            
                                Quick way to check if the pandas series contains a negative value
                            
                                How to detect and remove outliers from each column of pandas dataframe at one go? [duplicate]
                            
                                How to choose randomly between two values? [duplicate]
                            
                                TensorFlow 2.0 dataset.__iter__() is only supported when eager execution is enabled
                            
                                'utf-8' codec can't decode byte 0xe2 : invalid continuation byte error
                            
                                Discord.py - SyntaxError f-string: empty expression not allowed

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to scrape through Single page Application websites in python using bs4

Tags:

python

beautifulsoup

web-scraping

Saurabh Rawat

People also ask

1 Answers

Dealing with auth

abdusco

Recent Activity

Donate For Us