Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# WebClient doesn't return UTF-8

Hey :) I'm trying really hard to make WebClient return me UTF-8. But when sub should return something like Ä it's more a E or so I think.

Gave a lot of workarounds a try, but It won't work.

private string translate(string input, string languagePair)
{
    string url = String.Format("https://translate.google.com/?hl=en&ie=UTF8&text={0}&langpair={1}", input, languagePair);
    WebClient wc = new WebClient();
    wc.Headers.Add(HttpRequestHeader.AcceptCharset, "UTF-8");
    wc.Encoding = Encoding.UTF8;
    var data = wc.DownloadData(url);
    var result = Encoding.UTF8.GetString(data);
    //string result = wc.DownloadString(url);
    int start = result.IndexOf("result_box");
    string sub = result.Substring(start);
    sub = sub.Substring(0, sub.IndexOf("</span>"));
    start = sub.LastIndexOf(">");
    sub = sub.Substring(start + 1);
    return sub;
}
like image 977
koin Avatar asked Mar 09 '23 01:03

koin


1 Answers

Google simply ignores encoding sent in AcceptCharset headers and returns response in ISO-8859-1, as you can see from shortened response:

HTTP/1.1 200 OK
Content-Type: text/html; charset=ISO-8859-1
Content-Language: en
Content-Length: 64202

<!DOCTYPE html><html><head><meta content="text/html; charset=ISO-8859-1" http-equiv="content-type">

Therefore when you decode response using UTF-8 encoding, you get invalid characters. If you want just to make it work quickly, I have found that when User-Agent header is added to request, Google returns response in UTF-8 and you can leave rest of code unmodified:

private static string translate(string input, string languagePair)
{
    string url = String.Format("https://translate.google.com/?hl=en&ie=UTF8&text={0}&langpair={1}", input, languagePair);
    WebClient wc = new WebClient();
    wc.Headers.Add(HttpRequestHeader.AcceptCharset, "utf-8");
    wc.Headers.Add(HttpRequestHeader.UserAgent, "Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/55.0");
    wc.Encoding = Encoding.UTF8;
    string result = wc.DownloadString(url);
    int start = result.IndexOf("result_box");
    string sub = result.Substring(start);
    sub = sub.Substring(0, sub.IndexOf("</span>"));
    start = sub.LastIndexOf(">");
    sub = sub.Substring(start + 1);
    return sub;
}

Better solution is to detect encoding used in response and use it for decoding. WebClient does not have this detection built-in, so you can either use solution described here or use HttpClient instead, which does this for you automatically:

private static async Task<string> translate(string input, string languagePair)
{
    string url = String.Format("https://translate.google.com/?hl=en&ie=UTF8&text={0}&langpair={1}", input, languagePair);
    using (var hc = new HttpClient())
    {
        var result = await hc.GetStringAsync(url).ConfigureAwait(false);
        int start = result.IndexOf("result_box");
        string sub = result.Substring(start);
        sub = sub.Substring(0, sub.IndexOf("</span>"));
        start = sub.LastIndexOf(">");
        sub = sub.Substring(start + 1);
        return sub;
    }
}

Also please note that Google has Translation API, which might be better to use rather than parsing translation from HTML page.

like image 87
Ňuf Avatar answered Mar 21 '23 00:03

Ňuf