Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scrape through Single page Application websites in python using bs4

I am scraping players name through the NBA website. The player's name webpage is designed using a single page application. The Players are distributed across several pages in alphabetical order. I am unable to extract the names of all the players. Here is the link: https://in.global.nba.com/playerindex/

from selenium import webdriver
from bs4 import BeautifulSoup

class make():
    def __init__(self):
        self.first=""
        self.last=""

driver= webdriver.PhantomJS(executable_path=r'E:\Downloads\Compressed\phantomjs-2.1.1-windows\bin\phantomjs.exe')

driver.get('https://in.global.nba.com/playerindex/')

html_doc = driver.page_source


soup = BeautifulSoup(html_doc,'lxml')

names = []

layer = soup.find_all("a",class_="player-name ng-isolate-scope")
for a in layer:
    span = a.find("span",class_="ng-binding")
    thing = make()
    thing.first = span.text
    spans = a.find("span",class_="ng-binding").find_next_sibling()
    thing.last = spans.text
    names.append(thing)
like image 455
Saurabh Rawat Avatar asked Jul 16 '19 17:07

Saurabh Rawat


People also ask

Can you scrape websites with Python?

Instead of looking at the job site every day, you can use Python to help automate your job search's repetitive parts. Automated web scraping can be a solution to speed up the data collection process. You write your code once, and it will get the information you want many times and from many pages.

How do I use bs4 in Python?

First, we need to import all the libraries that we are going to use. Next, declare a variable for the url of the page. Then, make use of the Python urllib2 to get the HTML page of the url declared. Finally, parse the page into BeautifulSoup format so we can use BeautifulSoup to work on it.


1 Answers

When dealing with SPAs, you shouldn't try to extract info from DOM, because the DOM is incomplete without running a JS-capable browser to populate it with data. Open up the page source, and you'll see the page HTML doesn't have the data you need.

But most SPAs load their data using XHR requests. You can monitor network requests in Developer Console (F12) to see the requests being made during page load.

Here https://in.global.nba.com/playerindex/ loads player list from https://in.global.nba.com/stats2/league/playerlist.json?locale=en

Simulate that request yourself, then pick whatever you need. Inspect the request headers to figure out what you need to send with the request.

import requests

if __name__ == '__main__':
    page_url = 'https://in.global.nba.com/playerindex/'
    s = requests.Session()
    s.headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0'}

    # visit the homepage to populate session with necessary cookies
    res = s.get(page_url)
    res.raise_for_status()

    json_url = 'https://in.global.nba.com/stats2/league/playerlist.json?locale=en'
    res = s.get(json_url)
    res.raise_for_status()
    data = res.json()

    player_names = [p['playerProfile']['displayName'] for p in data['payload']['players']]
    print(player_names)

output:

['Steven Adams', 'Bam Adebayo', 'Deng Adel', 'LaMarcus Aldridge', 'Kyle Alexander', 'Nickeil Alexander-Walker', ...

Dealing with auth

One thing to watch out for is that some websites require an authorization token to be sent with requests. You can see it in the API requests if it's present.

If you're building a scraper that needs to be functional in the long(er) term, you might want to make the script more robust by extracting the token from the page and including it in requests.

This token (mostly a JWT token, starts with ey...) usually sits somewhere in the HTML, encoded as JSON. Or it is sent to the client as a cookie, and the browser attaches it to the request, or in a header. In short, it could be anywhere. Scan the requests & responses to figure out where the token is coming from and how you can retrieve it yourself.

...
<script>
const state = {"token": "ey......", ...};
</script>
import json
import re

res = requests.get('url/to/page')

# extract the token from the page. Here `state` is an object serialized as JSON,
# we take everything after `=` sign until the semicolon and deserialize it
state = json.loads(re.search(r'const state = (.*);', res.text).group(1))
token = state['token']

res = requests.get('url/to/api/with/auth', headers={'authorization': f'Bearer {token}'})
like image 171
abdusco Avatar answered Sep 22 '22 07:09

abdusco