I am scraping players name through the NBA website. The player's name webpage is designed using a single page application. The Players are distributed across several pages in alphabetical order. I am unable to extract the names of all the players. Here is the link: https://in.global.nba.com/playerindex/
from selenium import webdriver
from bs4 import BeautifulSoup
class make():
def __init__(self):
self.first=""
self.last=""
driver= webdriver.PhantomJS(executable_path=r'E:\Downloads\Compressed\phantomjs-2.1.1-windows\bin\phantomjs.exe')
driver.get('https://in.global.nba.com/playerindex/')
html_doc = driver.page_source
soup = BeautifulSoup(html_doc,'lxml')
names = []
layer = soup.find_all("a",class_="player-name ng-isolate-scope")
for a in layer:
span = a.find("span",class_="ng-binding")
thing = make()
thing.first = span.text
spans = a.find("span",class_="ng-binding").find_next_sibling()
thing.last = spans.text
names.append(thing)
Instead of looking at the job site every day, you can use Python to help automate your job search's repetitive parts. Automated web scraping can be a solution to speed up the data collection process. You write your code once, and it will get the information you want many times and from many pages.
First, we need to import all the libraries that we are going to use. Next, declare a variable for the url of the page. Then, make use of the Python urllib2 to get the HTML page of the url declared. Finally, parse the page into BeautifulSoup format so we can use BeautifulSoup to work on it.
When dealing with SPAs, you shouldn't try to extract info from DOM, because the DOM is incomplete without running a JS-capable browser to populate it with data. Open up the page source, and you'll see the page HTML doesn't have the data you need.
But most SPAs load their data using XHR requests. You can monitor network requests in Developer Console (F12) to see the requests being made during page load.
Here https://in.global.nba.com/playerindex/
loads player list from https://in.global.nba.com/stats2/league/playerlist.json?locale=en
Simulate that request yourself, then pick whatever you need. Inspect the request headers to figure out what you need to send with the request.
import requests
if __name__ == '__main__':
page_url = 'https://in.global.nba.com/playerindex/'
s = requests.Session()
s.headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0'}
# visit the homepage to populate session with necessary cookies
res = s.get(page_url)
res.raise_for_status()
json_url = 'https://in.global.nba.com/stats2/league/playerlist.json?locale=en'
res = s.get(json_url)
res.raise_for_status()
data = res.json()
player_names = [p['playerProfile']['displayName'] for p in data['payload']['players']]
print(player_names)
output:
['Steven Adams', 'Bam Adebayo', 'Deng Adel', 'LaMarcus Aldridge', 'Kyle Alexander', 'Nickeil Alexander-Walker', ...
One thing to watch out for is that some websites require an authorization token to be sent with requests. You can see it in the API requests if it's present.
If you're building a scraper that needs to be functional in the long(er) term, you might want to make the script more robust by extracting the token from the page and including it in requests.
This token (mostly a JWT token, starts with ey...
) usually sits somewhere in the HTML, encoded as JSON. Or it is sent to the client as a cookie, and the browser attaches it to the request, or in a header. In short, it could be anywhere. Scan the requests & responses to figure out where the token is coming from and how you can retrieve it yourself.
...
<script>
const state = {"token": "ey......", ...};
</script>
import json
import re
res = requests.get('url/to/page')
# extract the token from the page. Here `state` is an object serialized as JSON,
# we take everything after `=` sign until the semicolon and deserialize it
state = json.loads(re.search(r'const state = (.*);', res.text).group(1))
token = state['token']
res = requests.get('url/to/api/with/auth', headers={'authorization': f'Bearer {token}'})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With