Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Encoding "ä" into "%E4"

Tags:

c#

encoding

I'm trying to understand what is the best encode from C# that fulfill a requirement on a new SMS Provider.

The text I want to send is:

Bäste Björn

The encoded text that the provider say it needs is:

B%E4ste+Bj%F6rn

so ä is %E4 and ö is %F6


From this answer, I got that, for such conversion I need to use HttpUtility.HtmlAttributeEncode as the normal HttpUtility.UrlEncode will output:

B%c3%a4ste+Bj%c3%b6rn

and that outputs weird chars on the mobile phone :/

as several chars are not converted, I tried this:

private string specialEncoding(string text)
{
    StringBuilder r = new StringBuilder();
    foreach (char c in text.ToCharArray())
    {
        string e = System.Web.HttpUtility.UrlEncode(c.ToString());
        if (e.StartsWith("%") && e.ToLower() != "%0a") // %0a == Linefeed
        {
            string attr = System.Web.HttpUtility.HtmlAttributeEncode(c.ToString());
            r.Append(attr);
        }
        else
        {
            r.Append(e);
        }

    }
    return r.ToString();
}

verbose so I could breakpoint and test each char, and found out that:

System.Web.HttpUtility.HtmlAttributeEncode("ä") is actually equal to ä... so there is no %E4 as output...

What am I missing? and is there a simply way to do the encoding without manipulating them char by char and have the required output?

like image 496
balexandre Avatar asked Mar 26 '14 10:03

balexandre


People also ask

Why does É become Ã?

This typically) happens when you're not decoding the text in the right encoding format (probably UTF-8).

Are German characters UTF-8?

As for what encoding to use, Germans often use ISO/IEC 8859-15, but UTF-8 is increasingly becoming the norm, and can handle any kind of non-ASCII characters at the same time.

Does UTF-8 use 8bits?

UTF-8 is an 8-bit variable width encoding. The first 128 characters in the Unicode, when represented with UTF-8 encoding have the representation as the characters in ASCII.


1 Answers

that the provider say it needs

Ask the provider in which age they are living. According to Wikipedia: Percent-encoding:

The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.

Granted, this RFC talks about "new URI schemes", which HTTP obviously is not, but adhering to this standard prevents headaches like this. See also What is the proper way to URL encode Unicode characters?.

They seem to want you to encode characters according to the Windows-1250 Code Page (or comparable, like ISO-8859-1 or -2, check alternatives here) instead, as using that code page E4 (132) maps to ä and F6 (148) maps to ö. As @Simon points out in his comment, you should ask the provider which code page exactly they want you to use.

Assuming Windows-1250, you can implement it like this, according to URL encode ASCII/UTF16 characters:

var windows1250 = Encoding.GetEncoding(1250);
var percentEncoded = HttpUtility.UrlEncode("Bäste Björn", windows1250);

The value of percentEncoded is:

B%e4ste+Bj%f6rn

If they insist on using uppercase, see .net UrlEncode - lowercase problem.

like image 137
CodeCaster Avatar answered Oct 06 '22 18:10

CodeCaster