Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Text from website appears as Gibberish instead of Hebrew

I'm trying to get a string from a website. I use the requests module to send the GET request.

text = requests.get("http://example.com") #send GET requests to the website
print text.text #print the variable

However, for some reason, the text appears in Gibberish instead of Hebrew:

<div>
<p>שרת</p>
</div>

Tough when I sniff the traffic with Fiddler or view the website in my browser, I see it in Hebrew:

<div>
<p>שרת</p>
</div>

By the way, the html code contains meta-tag that defines the encoding, which is utf-8. I tried to encode the text to utf-8 but it still in gibberish. I tried to deocde it using utf-8, but it throws UnicodeEncodeError exception. I declared that I'm using utf-8 in the first line of the script. Moreover, the problem is also happend when I send the request with the built in urllib module.

I read the Unicode HOWTO, but still couldn't manage to fix it. I also read many threads here (both about the UnicodeEncodeError exception and about why hebrew turns into gibberish in Python) but I still couldn't manage to fix it up.

I'm using Python 2.7.9 on a Windows machine. I'm running my script in the Python IDLE.

Thanks in advance.

like image 251
ohad987 Avatar asked May 01 '15 14:05

ohad987


1 Answers

The server isn't declaring the encoding correctly.

>>> print u'שרת'.encode('latin-1').decode('utf-8')
שרת

Set text.encoding before accessing text.text.

text = requests.get("http://example.com") #send GET requests to the website
text.encoding = 'utf-8' # Correct the page encoding
print text.text #print the variable
like image 143
Ignacio Vazquez-Abrams Avatar answered Nov 15 '22 03:11

Ignacio Vazquez-Abrams