I have used requests library for many times and I know it has a ton of advantages. However, I was trying to retrieve the following Wikipedia page:
https://en.wikipedia.org/wiki/Talk:Land_value_tax
and requests.get retrieves it partially:
response = requests.get('https://en.wikipedia.org/wiki/Talk:Land_value_tax', verify=False)
html = response.text
I tried it using urllib2 and urllib2.urlopen and it retrieves the same page completely:
html = urllib2.urlopen('https://en.wikipedia.org/wiki/Talk:Land_value_tax').read()
Does anyone know why this happens and how to solve it using requests?
By the way, looking at the number of times this post has been viewed, I realized that people are interested to know the differences between these two libraries. If anyone knows about other differences between these two libraries, I'll appreciate it if they edit this question or post an answer and add those differences.
Seems to me the problem lies in the scripting on the target page. The js-driven content is rendered in here (especially i found calls to mediawiki). So, you need to look at web sniffer to identify it:

What to do? If you want to retrieve the whole page content, you better plugin any of libraries working out (evaluating) in page javascript. Read more here.
I am not interested in retrieving the whole page and statistics or JS libraries retrieved from MediaWiki. I only need the whole content of the page (through scraping, not MediaWiki API).
The issue is that those js calls to other resources (incl. mediawiki) make possible to render the WHOLE page to client. But since the library does not support JS execution, js is not executed => page parts are not loaded from other resources => target page is not whole.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With