Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sanitize Foreign Characters / Accents From URL

I need to write a server side function to sanitize URL encoded strings.

Example querystring:

FirstName=John&LastName=B%F3th&Address=San+Endre+%FAt+12%2F14

When I pass that through HttpUtility.UrlDecode() I get:

FirstName=John&LastName=B�th&Address=San Endre �t 12/14

The function from this SO post is looks perfect but it expects decoded strings that already have accents:

RemoveDiacritics('Bóth`) ==> 'Both';
RemoveDiacritics('San Endre út 12/14`) ==> 'San Endre ut 12/14';

How can I decode the URL without getting all these characters?

I cannot do anything client side or change the way they come into my function.

like image 470
Greg Avatar asked Jan 20 '12 20:01

Greg


3 Answers

I agree with the arguments already put forth; however, if you’re always receiving your encoded strings from the same client, then you may match their encoding. In this case, they appear to be using ISO/IEC 8859-1, informally known as Latin-1, which is one of the most popular 8-bit character set in use. You can decode ISO/IEC 8859-1 using the following code (which will correctly decode the sample string you provided):

HttpUtility.UrlDecode(encodedInput, Encoding.GetEncoding("iso-8859-1"));

MSDN guarantees that the above code page will be natively supported by the .NET Framework, regardless of the underlying platform; refer to the table of supported encodings for the Encoding Class.

like image 98
Douglas Avatar answered Oct 01 '22 06:10

Douglas


UrlDecode expects UTF-8 for its input, where each character larger than \u007F is encoded as at least 2 bytes. So the correct string (if the character is \u00F3, ó) would have contained %C3%B3, not %F3.

If the strings arrive the way you get them, I'm not sure there's much you can do. Not with the standard libraries, that is.

By the way, stripping accents from foreign characters is OK, but I wouldn't call it "sanitizing".

like image 41
Mr Lister Avatar answered Oct 01 '22 05:10

Mr Lister


%F3 and %FA are not in UTF8 nor ASCII encoding. It looks like client side code encodes string in current page's locale.

Depending on your needs you can either simply strip out all characters above 127, or figure out how to properly decode incoming Url (I don't think built in function exist to handle it as is).

I would copy characters into a byte array (including manually decoded %-encoded ones) and use correct Encoding to convert it to string (using Encoding.GetString - http://msdn.microsoft.com/en-us/library/system.text.encoding.getstring.aspx) .

like image 38
Alexei Levenkov Avatar answered Oct 01 '22 05:10

Alexei Levenkov