Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python requests isn't giving me the same HTML as my browser is

Tags:

I am grabbing a Wikia page using Python requests. There's a problem, though: the requests request isn't giving me the same HTML as my browser is with the very same page.

For comparison, here's the page Firefox gets me, and here's the page requests fetches (download them to view - sorry, no easy way to just visually host a bit of HTML from another site).

You'll note a few differences (super unfriendly diff). There are some small things, like attributes beinig ordered differently and such, but there are also a few very, very large things. Most important is the lack of the last six <img>s, and the entirety of the navigation and footer sections. Even in the raw HTML it looks like the page cut off abruptly.

Why is this happening, and is there a way to fix it? I've thought of a bunch of things already, none of which have been fruitful:

  • Request headers interfering? Nope, I tried copying the headers my browser sends, User-Agent and all, 1:1 into the requests request, but nothing changed.
  • JavaScript loading content after the HTML is loaded? Nah. Even with JS disabled, Firefox gives me the "good" page.
  • Uh... well... what else could there be?

It'd be amazing if you know a way this could happen and a way to fix it. Thank you!

like image 355
obskyr Avatar asked Apr 21 '15 13:04

obskyr


People also ask

Why does Python request not work?

The Python error "ModuleNotFoundError: No module named 'requests'" occurs for multiple reasons: Not having the requests package installed by running pip install requests . Installing the package in a different Python version than the one you're using. Installing the package globally and not in your virtual environment.

Which Python library can be used to make HTTP requests to Web applications?

requests - Easily the most popular package for making requests using Python. urllib3 - Not to be confused with urllib , which is part of the Python standard library.

What is requests HTML in Python?

Requests-HTML: HTML Parsing for Humans™ This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible.

Is requests native to Python?

Project description. Requests is an Apache2 Licensed HTTP library, written in Python, for human beings.


1 Answers

I had a similar issue:

  • Identical headers with Python and through the browser
  • JavaScript definitely ruled out as a cause

To resolve the issue, I ended up swapping out the requests library for urllib.request.

Basically, I replaced:

import requests  session = requests.Session() r = session.get(URL) 

with:

import urllib.request  r = urllib.request.urlopen(URL) 

and then it worked.

Maybe one of those libraries is doing something strange behind the scenes? Not sure if that's an option for you or not.

like image 121
rnhuneau Avatar answered Sep 29 '22 12:09

rnhuneau