When parsing HTML for certain web pages (most notably, any windows live page) I encounter a lot of URL’s in the following format.
http\x3a\x2f\x2fjs.wlxrs.com\x2fjt6xQREgnzkhGufPqwcJjg\x2fempty.htm
These appear to be partially UTF8 escaped strings (\x2f = /, \x3a=:, etc …). Is there a .Net API that can be used to transform these strings into a System.Uri? Seems easy enough to parse but I’m trying to avoid building a new wheel today.
What you posted is not valid HTTP. As such, of course HttpUtility.UrlDecode()
won't work. But irrespective of that, you can turn this back into normal text like this:
string input = @"http\x3a\x2f\x2fjs.wlxrs.com\x2fjt6xQREgnzkhGufPqwcJjg\x2fempty.htm";
string output = Regex.Replace(input, @"\\x([0-9a-f][0-9a-f])",
m => ((char) int.Parse(m.Groups[1].Value, NumberStyles.HexNumber)).ToString());
But notice that this assumes that the encoding is Latin-1 rather than UTF-8. The input you provided is inconclusive in that respect. If you need UTF-8 to work, you need a slightly longer route; you'll have to convert the string to bytes and replace the escape sequences with the relevant bytes in the process (probably needs a while loop), and then use Encoding.UTF8.GetString()
on the resulting byte array.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With