I am using requests and bs4 to scrape some data from a Chinese website that also has an English version. I wrote this to see if I get the right data:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://dotamax.com/hero/rate/')
soup = BeautifulSoup(page.content, "lxml")
for i in soup.find_all('span'):
print i.text
And I do, the only problem is that the text is in Chinese, although it is in English when I look at the page source. Why do I get Chinese instead of English. How to fix that?
The website appears to check the GET request for an Accept-Language
parameter. If the request doesn't have one, it shows the Chinese version. However, this is an easy fix - use headers as described in the requests documentation:
import requests
from bs4 import BeautifulSoup
headers = {'Accept-Language': 'en-US,en;q=0.8'}
page = requests.get('http://dotamax.com/hero/rate/', headers=headers)
soup = BeautifulSoup(page.content, "lxml")
for i in soup.find_all('span'):
print i.text
produces:
Anti-Mage
Axe
Bane
Bloodseeker
Crystal Maiden
Drow Ranger
...
etc.
Usually when a request shows up differently in your browser and in the requests content, it has to do with the type of request and headers you're using. One really useful tip for web-scraping that I wish I had realized much earlier on is that if you hit F12 and go to the "Network" tab on Chrome or Firefox, you can get a lot of useful information that you can use for debugging:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With