I'm trying to pull an audio file from google's text-to-speech function. Basically, you toss in the link and then concat whatever you want to be spoken at the end of it. I've gotten the below code to work just fine for English, so I think the problem must be how the Chinese characters are getting encoded in the request. Here's what I've got:
String text = "text to be spoken";
public static final String AUDIO_CHINESE= "http://www.translate.google.com/translate_tts?tl=zh&q=";
public static final String AUDIO_ENGLISH = "http://www.translate.google.com/translate_tts?tl=en&q=";
URL url = new URL(AUDIO_ENGLISH + text);
urlConnection = (HttpURLConnection) url.openConnection();
urlConnection.setRequestMethod("GET");
urlConnection.setRequestProperty("Accept-Charset", Variables.UTF_8);
if (urlConnection.getResponseCode() ==200) {
//get byte array in response
in = new DataInputStream(urlConnection.getInputStream());
} else {
in = new DataInputStream(urlConnection.getErrorStream());
}
//use commons io
byte[] bytes = IOUtils.toByteArray(in);
in.close();
urlConnection.disconnect();
return bytes;
When I try this with Chinese characters, though, it returns something that I can't get to play in the mediaplayer (I suspect it's not a proper audio file as the vast majority of bytes are '85'). So I've tried both
String chText = "你好";
URL url = new URL(AUDIO_CHINESE + URLEncoder.encode(chText, "UTF-8));
and
URL url = new URL(AUDIO_CHINESE + Uri.encode(chText, "UTF-8"));
and then adding
urlConnection.setRequestProperty("content-type", "application/x-www-form-urlencoded; charset=UTF-8");
to the request header. This just made it worse, though, because now it doesn't even return a 200 code, instead stating "FileNotFound" in logcat.
So on a whim, I went back and tried the URL/Uri encoding with the English text, and now the English won't return a valid result either. Not sure what's going on here: the raw url in the debugger works fine if I copy and paste into Chrome, but for some reason the urlConnection just doesn't work. Feel like I'm missing something obvious.
EDIT
Fiddling with it some more has revealed no answer, just more confusion (and exasperation). For some reason, when sent over httpurlconnection, the Google tts machine reads the utf-8 percent-encoded text as utf-16, at least as far as I can tell. For example, the character "維" (wei2) is %E7%B6%AD
, but if you pass it through the connection, you'll get a file that pronounces "see" ("ç", to be precise).
ç, as it turns out, is 0x00E7
in UTF-16 (its utf-8 percent-encoded version is %C3%A7
). I have no idea why it does that in Java, because putting the appropriate % at the end of the link in any browser will work properly. Thus far, I have tried various combinations of trying to get the tts to read the entirety of %E7%B6%AD
without much success.
EDIT2
Solution to my problem found! See below for answer. The problem wasn't in the encoding, it was in the parsing on Google's end. Have edited the title accordingly. Cheers!
So, as it turns out, the problem at the end wasn't the encoding at all; it was the processing at Google's end. To get the service to correctly recognize UTF-8, you need to use this link http://www.translate.google.com/translate_tts?ie=utf-8&tl=zh-cn&q=
instead of the one above. Note the ie=utf-8
added to the parameter. So you can just URLEncoder.encode("你好嗎", "UTF-8")
, append it to the link, and send it up as per usual. Whew!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With