Hey :) I'm trying really hard to make WebClient return me UTF-8. But when sub should return something like Ä
it's more a E
or so I think.
Gave a lot of workarounds a try, but It won't work.
private string translate(string input, string languagePair)
{
string url = String.Format("https://translate.google.com/?hl=en&ie=UTF8&text={0}&langpair={1}", input, languagePair);
WebClient wc = new WebClient();
wc.Headers.Add(HttpRequestHeader.AcceptCharset, "UTF-8");
wc.Encoding = Encoding.UTF8;
var data = wc.DownloadData(url);
var result = Encoding.UTF8.GetString(data);
//string result = wc.DownloadString(url);
int start = result.IndexOf("result_box");
string sub = result.Substring(start);
sub = sub.Substring(0, sub.IndexOf("</span>"));
start = sub.LastIndexOf(">");
sub = sub.Substring(start + 1);
return sub;
}
Google simply ignores encoding sent in AcceptCharset
headers and returns response in ISO-8859-1
, as you can see from shortened response:
HTTP/1.1 200 OK
Content-Type: text/html; charset=ISO-8859-1
Content-Language: en
Content-Length: 64202
<!DOCTYPE html><html><head><meta content="text/html; charset=ISO-8859-1" http-equiv="content-type">
Therefore when you decode response using UTF-8 encoding, you get invalid characters. If you want just to make it work quickly, I have found that when User-Agent
header is added to request, Google returns response in UTF-8 and you can leave rest of code unmodified:
private static string translate(string input, string languagePair)
{
string url = String.Format("https://translate.google.com/?hl=en&ie=UTF8&text={0}&langpair={1}", input, languagePair);
WebClient wc = new WebClient();
wc.Headers.Add(HttpRequestHeader.AcceptCharset, "utf-8");
wc.Headers.Add(HttpRequestHeader.UserAgent, "Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/55.0");
wc.Encoding = Encoding.UTF8;
string result = wc.DownloadString(url);
int start = result.IndexOf("result_box");
string sub = result.Substring(start);
sub = sub.Substring(0, sub.IndexOf("</span>"));
start = sub.LastIndexOf(">");
sub = sub.Substring(start + 1);
return sub;
}
Better solution is to detect encoding used in response and use it for decoding. WebClient
does not have this detection built-in, so you can either use solution described here or use HttpClient
instead, which does this for you automatically:
private static async Task<string> translate(string input, string languagePair)
{
string url = String.Format("https://translate.google.com/?hl=en&ie=UTF8&text={0}&langpair={1}", input, languagePair);
using (var hc = new HttpClient())
{
var result = await hc.GetStringAsync(url).ConfigureAwait(false);
int start = result.IndexOf("result_box");
string sub = result.Substring(start);
sub = sub.Substring(0, sub.IndexOf("</span>"));
start = sub.LastIndexOf(">");
sub = sub.Substring(start + 1);
return sub;
}
}
Also please note that Google has Translation API, which might be better to use rather than parsing translation from HTML page.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With