I want to get the content from this website.
If I use a browser like Firefox or Chrome I could get the real website page I want, but if I use the Python requests package (or wget
command) to get it, it returns a totally different HTML page.
I thought the developer of the website had made some blocks for this.
How do I fake a browser visit by using python requests or command wget?
The process is called user-agent spoofing. Yes, when a browser or any client sends a different user-agent HTTP header from what they are and fake it that is called spoofing. While the term may be alarming, this is not a dangerous activity and will not cause any problems for you.
user_agents is a Python library that provides an easy way to identify/detect devices like mobile phones, tablets and their capabilities by parsing (browser/HTTP) user agent strings. The goal is to reliably detect whether: User agent is a mobile, tablet or PC based device.
A user agent is a computer program representing a person, for example, a browser in a Web context. Besides a browser, a user agent could be a bot scraping webpages, a download manager, or another app accessing the Web.
A user agent is any software that retrieves and presents Web content for end users or is implemented using Web technologies. User agents include Web browsers, media players, and plug-ins that help in retrieving, rendering and interacting with Web content.
Provide a User-Agent
header:
import requests
url = 'http://www.ichangtou.com/#company:data_000008.html'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
print(response.content)
FYI, here is a list of User-Agent strings for different browsers:
As a side note, there is a pretty useful third-party package called fake-useragent that provides a nice abstraction layer over user agents:
fake-useragent
Up to date simple useragent faker with real world database
Demo:
>>> from fake_useragent import UserAgent
>>> ua = UserAgent()
>>> ua.chrome
u'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36'
>>> ua.random
u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36'
I used fake UserAgent.
How to use:
from fake_useragent import UserAgent
import requests
ua = UserAgent()
print(ua.chrome)
header = {'User-Agent':str(ua.chrome)}
print(header)
url = "https://www.hybrid-analysis.com/recent-submissions?filter=file&sort=^timestamp"
htmlContent = requests.get(url, headers=header)
print(htmlContent)
Output:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17
{'User-Agent': 'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'}
<Response [200]>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With