Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML/Url decode on multiple times encoded string

Tags:

c#

We have a string which is readed from web page. Because browsers are tolerant to unencoded special chars (e.g. ampersand), some pages using it encoded, some not... so there is a large possibility, we have stored some data encoded once, and some multiple times...

Is there some clear solution, how to be sure, my string is decoded enough no matter how many times it was encoded?

Here is what we using now:

public static string HtmlDecode(this string input)
{
     var temp = HttpUtility.HtmlDecode(input);
     while (temp != input)
     {
         input = temp;
         temp = HttpUtility.HtmlDecode(input);
     }
     return input;
}

and same using with UrlDecode.

like image 441
sasjaq Avatar asked Apr 02 '14 20:04

sasjaq


People also ask

What happens if you double encode a URL?

By using double encoding it's possible to bypass security filters that only decode user input once. The second decoding process is executed by the backend platform or modules that properly handle encoded data, but don't have the corresponding security checks in place.

What is %2f in URL?

URL encoding converts characters into a format that can be transmitted over the Internet. - w3Schools. So, "/" is actually a seperator, but "%2f" becomes an ordinary character that simply represents "/" character in element of your url.

What is the difference between Htmlencode and Urlencode?

HTMLEncoding turns this character into "<" which is the encoded representation of the less-than sign. URLEncoding does the same, but for URLs, for which the special characters are different, although there is some overlap. Save this answer. Show activity on this post.

What is %2C HTML?

Simple & Easy answer, The %2C means , comma in URL.


3 Answers

That's probably the best approach honestly. The real solution would be to rework your code so that you only singly encode things in all places, so that you could only singly decode them.

like image 199
Haney Avatar answered Oct 26 '22 15:10

Haney


Your code seems to be decoding html strings correctly, with multiple checks.

However, if the input HTML is malformed, i.e not encoded properly, the decoding will be unexpected. i.e bad inputs might not be decoded properly no matter how many times it passes through this method.

A quick check with two encoded strings, one with completely encoded string, and another with partially encoded yielded the following results.

"&lt;b&gt;" will decode to "<b>"

"&lt;b&gt will decode to "<b&gt"

like image 26
LakshmiNarayanan Avatar answered Oct 26 '22 14:10

LakshmiNarayanan


In case this is helpful to anyone, here is a recursive version for multiple HTML encoded strings (I find it a bit easier to read):

public static string HtmlDecode(string input) {
    string decodedInput = WebUtility.HtmlDecode(input);

    if (input == decodedInput) {
        return input;
    }

    return HtmlDecode(decodedInput);
}

WebUtility is in the System.Net namespace.

like image 31
Dimitar Dimitrov Avatar answered Oct 26 '22 15:10

Dimitar Dimitrov