Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding a parsing API for partially utf8 encoded URL's

Tags:

c#

.net

uri

When parsing HTML for certain web pages (most notably, any windows live page) I encounter a lot of URL’s in the following format.

http\x3a\x2f\x2fjs.wlxrs.com\x2fjt6xQREgnzkhGufPqwcJjg\x2fempty.htm

These appear to be partially UTF8 escaped strings (\x2f = /, \x3a=:, etc …). Is there a .Net API that can be used to transform these strings into a System.Uri? Seems easy enough to parse but I’m trying to avoid building a new wheel today.

like image 568
JaredPar Avatar asked Dec 11 '08 16:12

JaredPar


1 Answers

What you posted is not valid HTTP. As such, of course HttpUtility.UrlDecode() won't work. But irrespective of that, you can turn this back into normal text like this:

string input = @"http\x3a\x2f\x2fjs.wlxrs.com\x2fjt6xQREgnzkhGufPqwcJjg\x2fempty.htm";
string output = Regex.Replace(input, @"\\x([0-9a-f][0-9a-f])",
    m => ((char) int.Parse(m.Groups[1].Value, NumberStyles.HexNumber)).ToString());

But notice that this assumes that the encoding is Latin-1 rather than UTF-8. The input you provided is inconclusive in that respect. If you need UTF-8 to work, you need a slightly longer route; you'll have to convert the string to bytes and replace the escape sequences with the relevant bytes in the process (probably needs a while loop), and then use Encoding.UTF8.GetString() on the resulting byte array.

like image 95
Timwi Avatar answered Oct 05 '22 23:10

Timwi