Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Encoding of the response of the Invoke-Webrequest

Tags:

powershell

When using cmdlet InvokeWebRequest against some web with non-english characters, I see no way of defining the encoding of the response / page content.

I use simple GET on http://colours.cz/ucinkujici/ and names of those artists are corrupted. You can try it with this simple line:

Invoke-WebRequest http://colours.cz/ucinkujici

Is this caused by design of the cmdlet? Can I specify encoding somwhere somehow? Is there any workaround to get properly parsed response?

like image 700
jumbo Avatar asked Jul 17 '13 17:07

jumbo


1 Answers

It seems to me you are correct :/

Here is one way to get the content right, by saving the response to a file first and then reading it into a variable with the correct encoding. however, you are not dealing with a HtmlWebResponseObject:

Invoke-WebRequest http://colours.cz/ucinkujici -outfile .\colours.cz.txt
$content = gc .\colours.cz.txt -Encoding utf8 -raw

This will get you equally far:

[net.httpwebrequest]$httpwebrequest = [net.webrequest]::create('http://colours.cz/ucinkujici/')
[net.httpWebResponse]$httpwebresponse = $httpwebrequest.getResponse()
$reader = new-object IO.StreamReader($httpwebresponse.getResponseStream())
$content = $reader.ReadToEnd()
$reader.Close()

Should you really want such a HtmlWebResponseObject, here is a way to get e.g. stuff from ParsedHtml more or less "readable" with Invoke-WebRequest ($bad vs. $better):

Invoke-WebRequest http://colours.cz/ucinkujici -outvariable htmlwebresponse
$bad = $htmlwebresponse.parsedhtml.title
$better = [text.encoding]::utf8.getstring([text.encoding]::default.GetBytes($bad))
$bad = $htmlwebresponse.links[7].outerhtml
$better = [text.encoding]::utf8.getstring([text.encoding]::default.GetBytes($bad))

Update: Here is a new take on this, knowing you want to work with ParsedHtml.
Once you have your content (see first 2-line snippet which 1) saves response to file and then 2) 'reads' the file content with the correct encoding), you can do this:

$ParsedHtml = New-Object -com "HTMLFILE"
$ParsedHtml.IHTMLDocument2_write($content)
$ParsedHtml.Close()

Et voilà :] E.g. $ParsedHtml.title now shows correctly, guessing the rest will be OK as well…

like image 162
mousio Avatar answered Oct 22 '22 17:10

mousio