Made a request on the above Wikipedia page. Specifically I need to scrape "results matrix" from https://en.wikipedia.org/wiki/2017%E2%80%9318_La_Liga#Results
selectedSeasonPage = requests.get('https://en.wikipedia.org/wiki/2017–18_La_Liga', features='html5lib')
Doing pprint.pprint(selectedSeasonPage.text)
and jumping to source code of matrix, it can be seen it's incomplete.
Snippet of HTML returned by requests.get() :
<table class="wikitable plainrowheaders" style="text-align:center;font-size:100%;">
.
.
<th scope="row" style="text-align:right;"><a href="/wiki/Deportivo_Alav%C3%A9s" title="Deportivo Alavés">Alavés</a></th>
<td style="font-weight: normal;background-color:transparent;">— </td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">2–1</td>
HTML returned by requests.get() viewed through browser and as expected its not complete. Can check this image for reference.
Snippet from view-source and the output needed.
<table class="wikitable plainrowheaders" style="text-align:center;font-size:100%;">
.
.
<a href="/wiki/Deportivo_Alav%C3%A9s" title="Deportivo Alavés">Alavés</a></th>
<td style="font-weight: normal;background-color:transparent;">—</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">3–1</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">0–1</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">0–2</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">2–1</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">1–0</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">1–2</td>
Posting a sample HTML for reference since posting entire output is not possible. Can post more specific parts if required.
My question is how to get entire source of matrix without resulting in loss of values?
From what I understand going through previous questions, requests
fails to return expected output if some part of page is rendered by JavaScript. But this page seems to be simple HTML and CSS (at least the part that is required). Cannot use Selenium need to scrape multiple pages. Would be grateful for solution using requests
or something equivalent.
Requests version is 2.19.1. Python version is 3.7.0.
Is anything missing? I am new to this stuff, any help appreciated.
Almost your exact code without the "features" parameter in the get call:
import requests
selectedSeasonPage = requests.get('https://en.wikipedia.org/wiki/2017–18_La_Liga')
print(selectedSeasonPage.text)
Gives me:
<th scope="row" style="text-align:right;"><a href="/wiki/Deportivo_Alav%C3%A9s" title="Deportivo Alavés">Alavés</a>
</th>
<td style="font-weight:normal;background:transparent;">—</td>
<td style="white-space:nowrap;font-weight:normal;background:#BBF3FF;">3–1</td>
<td style="white-space:nowrap;font-weight:normal;background:#FBB;">0–1</td>
<td style="white-space:nowrap;font-weight:normal;background:#FBB;">0–2</td>
<td style="white-space:nowrap;font-weight:normal;background:#BBF3FF;">2–1</td>
<td style="white-space:nowrap;font-weight:normal;background:#BBF3FF;">1–0</td>
<td style="white-space:nowrap;font-weight:normal;background:#FBB;">1–2</td>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With