Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python requests.get() returns broken source code instead of expected source code?

Made a request on the above Wikipedia page. Specifically I need to scrape "results matrix" from https://en.wikipedia.org/wiki/2017%E2%80%9318_La_Liga#Results

selectedSeasonPage = requests.get('https://en.wikipedia.org/wiki/2017–18_La_Liga', features='html5lib')

Doing pprint.pprint(selectedSeasonPage.text) and jumping to source code of matrix, it can be seen it's incomplete.

Snippet of HTML returned by requests.get() :

<table class="wikitable plainrowheaders" style="text-align:center;font-size:100%;">
.
.
<th scope="row" style="text-align:right;"><a href="/wiki/Deportivo_Alav%C3%A9s" title="Deportivo Alavés">Alavés</a></th>
<td style="font-weight: normal;background-color:transparent;">— </td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">2–1</td>

HTML returned by requests.get() viewed through browser and as expected its not complete. Can check this image for reference.

Snippet from view-source and the output needed.

<table class="wikitable plainrowheaders" style="text-align:center;font-size:100%;">
.
.
<a href="/wiki/Deportivo_Alav%C3%A9s" title="Deportivo Alavés">Alavés</a></th>
<td style="font-weight: normal;background-color:transparent;">&#8212;</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">3–1</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">0–1</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">0–2</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">2–1</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">1–0</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">1–2</td>

Posting a sample HTML for reference since posting entire output is not possible. Can post more specific parts if required.

My question is how to get entire source of matrix without resulting in loss of values?

From what I understand going through previous questions, requests fails to return expected output if some part of page is rendered by JavaScript. But this page seems to be simple HTML and CSS (at least the part that is required). Cannot use Selenium need to scrape multiple pages. Would be grateful for solution using requests or something equivalent.

Requests version is 2.19.1. Python version is 3.7.0.

Is anything missing? I am new to this stuff, any help appreciated.

like image 883
Fatal Python Error Avatar asked Nov 08 '22 02:11

Fatal Python Error


1 Answers

Almost your exact code without the "features" parameter in the get call:

import requests
selectedSeasonPage = requests.get('https://en.wikipedia.org/wiki/2017–18_La_Liga')
print(selectedSeasonPage.text)

Gives me:

<th scope="row" style="text-align:right;"><a href="/wiki/Deportivo_Alav%C3%A9s" title="Deportivo Alavés">Alavés</a>
</th>
<td style="font-weight:normal;background:transparent;">&#8212;</td>
<td style="white-space:nowrap;font-weight:normal;background:#BBF3FF;">3–1</td>
<td style="white-space:nowrap;font-weight:normal;background:#FBB;">0–1</td>
<td style="white-space:nowrap;font-weight:normal;background:#FBB;">0–2</td>
<td style="white-space:nowrap;font-weight:normal;background:#BBF3FF;">2–1</td>
<td style="white-space:nowrap;font-weight:normal;background:#BBF3FF;">1–0</td>
<td style="white-space:nowrap;font-weight:normal;background:#FBB;">1–2</td>
like image 103
Noah B. Johnson Avatar answered Nov 14 '22 23:11

Noah B. Johnson