How to parse the HTML of a website with PowerShell

Tags:

I am trying to retrieve some information about a website, I want to look for a specific tag/class and then return the contained text value (innerHTML). This is what I have so far

$request = Invoke-WebRequest -Uri $url -UseBasicParsing
$HTML = New-Object -Com "HTMLFile"
$src = $request.RawContent
$HTML.write($src)


foreach ($obj in $HTML.all) { 
    $obj.getElementsByClassName('some-class-name') 
}

I think there is a problem with converting the HTML into the HTML object, since I see a lot of undefined properties and empty results when I'm trying to "Select-Object" them.

So after spending two days, how am I supposed to parse HTML with Powershell?

I can't use IHTMLDocument2 methods, since I don't have Office installed (Unable to use IHTMLDocument2)
I can't use the Invoke-Webrequest without -UseBasicParsing since the Powershell hangs and spawns additional windows while accessing the ParsedHTML property (parsedhtml doesnt respond anymore and Using Invoke-Webrequest in PowerShell 3.0 spawns a Windows Security Warning)

So since parsing HTML with regex is such a big no-no, how do I do it otherwise? Nothing seems to work.

718

asked Jun 28 '19 14:06

Jan

2 Answers

Since noone else has posted an answer, I managed to get a working solution with the following code:

$request = Invoke-WebRequest -Uri $URL -UseBasicParsing
$HTML = New-Object -Com "HTMLFile"
[string]$htmlBody = $request.Content
$HTML.write([ref]$htmlBody)
$filter = $HTML.getElementsByClassName($htmlClassName)

With some URLs I experienced that the $filter variable was empty while it was populated for other URLs. All in all this might work for your situation but it seems like Powershell isn't the way to go for more complex parsing.

170

answered Oct 19 '22 16:10

Jan

In 2020 with PowerShell 5+ you do it like this:

$searchClass = "banana" <# in this example we parse all elements of class "banana" but you can use any class name you wish #>
$myURI = "url.com" <# replace url.com with any website you want to scrape from #>

[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12 <# using TLS 1.2 is vitally important #>
$req = Invoke-Webrequest -URI $myURI
$req.ParsedHtml.getElementsByClassName($searchClass) | %{Write-Host $_.innerhtml}

#for extra credit we can parse all the links
$req.ParsedHtml.getElementsByTagName('a') | %{Write-Host $_.href} #outputs all the links

answered Oct 19 '22 15:10

Ben R

Related questions
                            
                                Javascript cloneNode is not a function [duplicate]
                            
                                creating html nav-tabs dynamic in angular 6
                            
                                HTML5 <base> Tag with root relative url
                            
                                Select Required validation not working in form
                            
                                Textarea with alternate rows and line numbers
                            
                                Javascript .innerHTML but excluding inner div
                            
                                How to generate video preview thumbnails for use in VideoJS?
                            
                                Angular - Can't bind to href
                            
                                How to pass variable on href? [closed]
                            
                                How to calculate height of css triangle? [duplicate]
                            
                                Create a two-column layout of with right aligned labels and left aligned values in CSS
                            
                                what do <form action=“.”>
                            
                                How to make a circular ScrollBox using JavaScript/CSS?
                            
                                CSS have each grid element a different height
                            
                                Text in Custom buttons randomly not centered vertically
                            
                                How can i change the color of my icon when ion-tab-button is active? CSS Ionic
                            
                                How to style elements if first-child hasClass?
                            
                                How to add vertical spacing between Bootstrap Cards
                            
                                Why does :nth-child not work as expected with a table having a caption?
                            
                                Is it possible to automatically have the last updated date on my website changed to the current date whenever I push changes to GitHub?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to parse the HTML of a website with PowerShell

Tags:

html

dom

powershell

html-parsing

Jan

People also ask

2 Answers

Jan

Ben R

Recent Activity

Donate For Us