Using querySelectorAll on an mshtml.HTMLDocumentClass object in PowerShell causes a crash

Tags:

I'm trying to do some web-scraping via PowerShell, as I've recently discovered it is possible to do so without too much trouble.

A good starting point is to just fetch the HTML, use Get-Member, and see what I can do from there, like so:

$html = Invoke-WebRequest "https://www.google.com"
$html.ParsedHtml | Get-Member

The methods available to me for fetching specific elements appear to be the following:

getElementById()
getElementsByName()
getElementsByTagName()

For example I can get the first IMG tag in the document like so:

$html.ParsedHtml.getElementsByTagName("img")[0]

However after doing some more research in to whether I could use CSS Selectors or XPath I discovered that there are unlisted methods available, since we are just using the HTML Document object documented here:

querySelector()
querySelectorAll()

So instead of doing:

$html.ParsedHtml.getElementsByTagName("img")[0]

I can do:

$html.ParsedHtml.querySelector("img")

So I was expecting to be able to do:

$html.ParsedHtml.querySelectorAll("img")

...in order to get all of the IMG elements. All the documentation I've found and googling I've done supports this. However, in all my testing this function crashes the calling process and reports a heap corruption exception code in the Event Log (0xc0000374).

I'm using PowerShell 5 on Windows 10 x64. I've tried it in a Win10 x64 VM that is a clean build and just patched up. I've also tried it in Win7 x64 upgraded to PowerShell 5. I haven't tried it on anything prior to PowerShell 5 as all our systems here are upgraded, but I probably will once I have time to spool a new vanilla VM for testing.

Has anyone run in to this issue before? All my research so far is a dead end. Are there alternatives to querySelectorAll? I need to scrape pages that will have predictable sets of tags inside unpredictable layouts and potentially no IDs or classes assigned to the tags, so I want to be able to use selectors that allow structure/nesting/wildcards.

P.S. I've also tried using the InternetExplorer.Application COM object in PowerShell, the result is the same, except instead of PowerShell crashing Internet Explorer crashes. This was actually my original approach, here's the code:

# create browser object
$ie = New-Object -ComObject InternetExplorer.Application

# make browser visible for debugging, otherwise this isn't necessary for function
$ie.Visible = $true

# browse to page
$ie.Navigate("https://www.google.com")
# wait till browser is not busy
Do { Start-Sleep -m 100 } Until (!$ie.Busy)

# this works
$ie.document.getElementsByTagName("img")[0]

# this works as well
$ie.document.querySelector("img")

# blow it up
$ie.document.querySelectorAll("img")

# we wanna quit the process, but since we blew it up we don't really make it here
$ie.Quit()

Hope I'm not breaking any rules and this post makes sense and is relevant, thanks.

UPDATE

I tested earlier PowerShell versions. v2-v4 crash using the InternetExplorer.Application COM method. v3-4 crash using the Invoke-WebRequest method, v2 doesn't support it.

602

asked May 12 '16 20:05

TheKojukinator

1 Answers

I ran into this problem, too, and posted about it on reddit. I believe the problem happens when Powershell tries to enumerate the HTML DOM NodeList object returned by querySelectorAll(). The same object is returned by childNodes() which can be enumerated by PS, so I'm guessing there's some glue code written for .ParsedHtml.childNodes but not .ParsedHtml.querySelectorAll(). The crash can be triggered by Intellisense trying to get tab-complete help for the object, too.

I found a way around it, though! Just access the native DOM methods .item() and .length directly and emit the node objects into a PowerShell array. The following code pulls the newest page of posts from /r/Powershell, gets the post list anchors via querySelectorAll() then manually enumerates them using the native DOM methods into a Powershell-native array.

$Result = Invoke-WebRequest -Uri "https://www.reddit.com/r/PowerShell/new/"

$NodeList = $Result.ParsedHtml.querySelectorAll("#siteTable div div p.title a")

$PsNodeList = @()
for ($i = 0; $i -lt $NodeList.Length; $i++) { 
    $PsNodeList += $NodeList.item($i)
}

$PsNodeList | ForEach-Object {
    $_.InnerHtml
}

Edit .Length seems to work capitalized or lower-case. I would have expected the DOM to be case-sensitive, so either there's some things going on to help translate or I'm misunderstanding something. Also, the CSS selector is grabbing the source links (self.PowerShell mostly), but that it my CSS selector logic error, not a problem with querySelectorAll(). Note that the results of querySelectorAll() are not live, so modifying them won't modify the original DOM. And I haven't tried modifying them or using their methods yet, but clearly we can grab at the very least .InnerHtml.

Edit 2: Here is a more-generalized wrapper function:

function Get-FixedQuerySelectorAll {
    param (
        $HtmlWro,
        $CssSelector
    )
    # After assignment, $NodeList will crash powershell if enumerated in any way including Intellisense-completion while coding!
    $NodeList = $HtmlWro.ParsedHtml.querySelectorAll($CssSelector)

    for ($i = 0; $i -lt $NodeList.length; $i++) {
        Write-Output $NodeList.item($i)
    }
}

$HtmlWro is an HTML Web Response Object, the output of Invoke-WebReqest. I originally tried to pass .ParsedHtml but then it would crash on assignment. Doing it this way returns the nodes in a Powershell array.

166

answered Sep 28 '22 01:09

midnightfreddie

Related questions
                            
                                Make Foreground a Backgrounded Job invoked by Start-Job in Powershell for Windows
                            
                                Scope vs Runspace vs Session vs AppDomain
                            
                                PowerShell Error Handling with Functions
                            
                                Why does Test-Connection force enumeration of reparse points?
                            
                                How to redirect Powershell output from a script run by TaskScheduler and override default width of 80 characters
                            
                                Clear PowerShell Sharepoint Cache?
                            
                                PowerShell return a single element array from function
                            
                                Invoking functions from nested modules in a script module do not always trigger a module to autoload
                            
                                Powershell excel refresh fails with "Call Was Rejected by Callee" when .visible=$false
                            
                                Set strict mode in powershell for all modules
                            
                                Signing a PowerShell script with self-signed certificates (and without makecert.exe)
                            
                                Don't throw PowerShell exception on writes to stderr from external command
                            
                                Powershell remoting with V1
                            
                                Not Recognizing Script Name as cmdlet, function, etc; nor can positional perameter be found on simple script
                            
                                Powershell - Increase the timeout for retrieving XML from a URL
                            
                                Powershell ftp upload error 530 not logged in
                            
                                Monitoring jobs in a PowerShell session from another PowerShell session
                            
                                Add an existing project to solution folder using PowerShell
                            
                                How can I add color to the machine name in the prompt of a PowerShell Remoting session?
                            
                                Powershell try/catch with test-connection

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using querySelectorAll on an mshtml.HTMLDocumentClass object in PowerShell causes a crash

Tags:

powershell

com

selectors-api

powershell-5.0

mshtml

TheKojukinator

People also ask

1 Answers

midnightfreddie

Recent Activity

Donate For Us