I am trying to retrieve some information about a website, I want to look for a specific tag/class and then return the contained text value (innerHTML). This is what I have so far
$request = Invoke-WebRequest -Uri $url -UseBasicParsing
$HTML = New-Object -Com "HTMLFile"
$src = $request.RawContent
$HTML.write($src)
foreach ($obj in $HTML.all) {
$obj.getElementsByClassName('some-class-name')
}
I think there is a problem with converting the HTML into the HTML object, since I see a lot of undefined properties and empty results when I'm trying to "Select-Object" them.
So after spending two days, how am I supposed to parse HTML with Powershell?
IHTMLDocument2
methods, since I don't have Office installed (Unable to use IHTMLDocument2)Invoke-Webrequest
without -UseBasicParsing
since the Powershell hangs and spawns additional windows while accessing the ParsedHTML property (parsedhtml doesnt respond anymore and Using Invoke-Webrequest in PowerShell 3.0 spawns a Windows Security Warning)So since parsing HTML with regex is such a big no-no, how do I do it otherwise? Nothing seems to work.
PowerShell includes some great capabilities for working with two common forms of structured data: HTML and XML.
But, using a scripting language like PowerShell, a little ingenuity and some trial and error, it is possible to build a reliable web-scraping tool in PowerShell to pull down information from a lot of different web pages.
If you just want to parse HTML and your HTML is intended for the body of your document, you could do the following : (1) var div=document. createElement("DIV"); (2) div. innerHTML = markup; (3) result = div. childNodes; --- This gives you a collection of childnodes and should work not just in IE8 but even in IE6-7.
The command uses the pipeline operator ( | ) to send the process objects to the ConvertTo-Html cmdlet. The command uses the Property parameter to select three properties of the process objects to be included in the table. The command uses the Title parameter to specify a title for the HTML page.
Since noone else has posted an answer, I managed to get a working solution with the following code:
$request = Invoke-WebRequest -Uri $URL -UseBasicParsing
$HTML = New-Object -Com "HTMLFile"
[string]$htmlBody = $request.Content
$HTML.write([ref]$htmlBody)
$filter = $HTML.getElementsByClassName($htmlClassName)
With some URLs I experienced that the $filter variable was empty while it was populated for other URLs. All in all this might work for your situation but it seems like Powershell isn't the way to go for more complex parsing.
In 2020 with PowerShell 5+ you do it like this:
$searchClass = "banana" <# in this example we parse all elements of class "banana" but you can use any class name you wish #>
$myURI = "url.com" <# replace url.com with any website you want to scrape from #>
[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12 <# using TLS 1.2 is vitally important #>
$req = Invoke-Webrequest -URI $myURI
$req.ParsedHtml.getElementsByClassName($searchClass) | %{Write-Host $_.innerhtml}
#for extra credit we can parse all the links
$req.ParsedHtml.getElementsByTagName('a') | %{Write-Host $_.href} #outputs all the links
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With