I need to write a server side function to sanitize URL encoded strings.
Example querystring:
FirstName=John&LastName=B%F3th&Address=San+Endre+%FAt+12%2F14
When I pass that through HttpUtility.UrlDecode()
I get:
FirstName=John&LastName=B�th&Address=San Endre �t 12/14
The function from this SO post is looks perfect but it expects decoded strings that already have accents:
RemoveDiacritics('Bóth`) ==> 'Both';
RemoveDiacritics('San Endre út 12/14`) ==> 'San Endre ut 12/14';
How can I decode the URL without getting all these �
characters?
I cannot do anything client side or change the way they come into my function.
I agree with the arguments already put forth; however, if you’re always receiving your encoded strings from the same client, then you may match their encoding. In this case, they appear to be using ISO/IEC 8859-1, informally known as Latin-1, which is one of the most popular 8-bit character set in use. You can decode ISO/IEC 8859-1 using the following code (which will correctly decode the sample string you provided):
HttpUtility.UrlDecode(encodedInput, Encoding.GetEncoding("iso-8859-1"));
MSDN guarantees that the above code page will be natively supported by the .NET Framework, regardless of the underlying platform; refer to the table of supported encodings for the Encoding Class.
UrlDecode expects UTF-8 for its input, where each character larger than \u007F is encoded as at least 2 bytes. So the correct string (if the character is \u00F3, ó) would have contained %C3%B3
, not %F3
.
If the strings arrive the way you get them, I'm not sure there's much you can do. Not with the standard libraries, that is.
By the way, stripping accents from foreign characters is OK, but I wouldn't call it "sanitizing".
%F3 and %FA are not in UTF8 nor ASCII encoding. It looks like client side code encodes string in current page's locale.
Depending on your needs you can either simply strip out all characters above 127, or figure out how to properly decode incoming Url (I don't think built in function exist to handle it as is).
I would copy characters into a byte array (including manually decoded %-encoded ones) and use correct Encoding to convert it to string (using Encoding.GetString - http://msdn.microsoft.com/en-us/library/system.text.encoding.getstring.aspx) .
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With