Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python requests downloading incorrect sound file from Google Translate

I'm using the script below to download the Chinese 老師, but when I run it, I get a file different from the one present at that URL. I think this an encoding issue, but as I've specified UTF-8, I'm not sure what's happening.

#!/usr/bin/python
# -*- coding: utf-8 -*-

import requests

url = "http://translate.google.com/translate_tts?tl=zh-CN&q=老師"

r = requests.get(url)

with open('test.mp3', 'wb') as test:
    test.write(r.content)

UPDATE:

As per @abarnert's suggestion, I've checked that the file is UTF-8 with BOM and tested the code with 'idna'.

#!/usr/bin/python3
# -*- coding: utf-8 -*-

import requests

url_1 = "http://translate.google.com/translate_tts?tl=zh-CN&q=老師"
url_2 = "http://translate.google.com/translate_tts?tl=zh-CN&q=\u8001\u5e2b"

r_1 = requests.get(url_1)
r_1_b = requests.get(url_1.encode('idna'))
r_2 = requests.get(url_2)
r_2_b = requests.get(url_2.encode('idna'))

# This downloads nonsense:
with open('r_1.mp3', 'wb') as test:
    test.write(r_1.content)

# This throws the error specified at bottom:
with open('r_1_b.mp3', 'wb') as test:
    test.write(r_1_b.content)

# This parses the characters individually, producing
# a file consisting of "u, eight, zero..." in Mandarin
with open('r_2.mp3', 'wb') as test:
    test.write(r_2.content)

# This produces a sound file consisting of "u, eight, zero, zero..." in Mandarin
with open('r_2_b.mp3', 'wb') as test:
    test.write(r_2_b.content)

The error I'm getting is:

Traceback (most recent call last):
  File "/home/MZ/Desktop/tts3.py", line 12, in <module>
    r_1_b = requests.get(url_1.encode('idna'))
  File "/usr/lib64/python2.7/encodings/idna.py", line 164, in encode
    result.append(ToASCII(label))
  File "/usr/lib64/python2.7/encodings/idna.py", line 76, in ToASCII
    label = nameprep(label)
  File "/usr/lib64/python2.7/encodings/idna.py", line 21, in nameprep
    newlabel.append(stringprep.map_table_b2(c))
  File "/usr/lib64/python2.7/stringprep.py", line 197, in map_table_b2
    b = unicodedata.normalize("NFKC", al)
TypeError: must be unicode, not str
[Finished in 15.3s with exit code 1]
like image 554
zadrozny Avatar asked Nov 10 '22 14:11

zadrozny


1 Answers

I've been able to reproduce your problem in Python 2 on both Linux and Windows (although the nonsense I get is different on each). But I can't reproduce it in Python 3, and I don't think you actually did either.

The short version is: you always want to use Unicode string literals if you want to include non-ASCII characters. On Python 2, that means a u prefix (on Python 3, the u prefix is meaningless but harmless):

url = u"http://translate.google.com/translate_tts?tl=zh-CN&q=老師"

And the safest thing to do (because then the wrong encoding in your text editor or your coding declaration can't affect anything) is:

url_2 = u"http://translate.google.com/translate_tts?tl=zh-CN&q=\u8001\u5e2b"

Without that, you're passing a bunch of UTF-8 bytes to requests without telling it that they're UTF-8.

What I'd expect it to do in that case is look at sys.getdefaultencoding(), which will probably be 'ascii' at least on Mac and Linux, try to decode with that, and get an exception. On Windows, it might be 'cp1252' or 'big5' or whatever your system setting is, so it might send mojibake.

But it's not actually doing that. I'm not sure what it's doing, but it correctly guesses UTF-8 on Mac, does something bizarre that results in "eh eh eh" in three different tones on Linux (I think it's just interpreting the bytes as the equivalent codepoints, so turns into U+00E8, U+0080, U+0081?), and something different and bizarre that starts with the same first syllable but then has different ones on Windows.

For url_2, it's a little simpler: in 2.x non-Unicode string literals, \u8001 isn't considered an escape sequence, it's just the six characters backslash, u, 8, 0, 0,1. Whichrequests` will dutifully send to Google, which it'll translate and send back to you as someone reading out those characters.

If you add the u prefix, however, both of them work.

And in Python 3, with or without the u prefix, both of them work. (Interestingly, in 3.x, it works even with a b prefix… but apparently only because it just always assumes bytes are UTF-8 in 3.x; if I give it Big5 bytes, it mojibakes them as UTF-8 even if my sys.getdefaultencoding is right.)

Also, manually query-string-encoding the query works, but that isn't necessary, and doesn't make any difference.

like image 152
abarnert Avatar answered Nov 14 '22 21:11

abarnert