Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use getElementById on HTMLElement instead of HTMLDocument

I've been playing around with scraping data from web pages using VBS/VBA.

If it were Javascript I'd be away as its easy, but it doesn't seem to be quite as straight forward in VBS/VBA.

This is an example I made for an answer, it works but I had planned on accessing the child nodes using getElementByTagName but I could not figure out how to use them! The HTMLElement object does not have those methods.

Sub Scrape()
Dim Browser As InternetExplorer
Dim Document As HTMLDocument
Dim Elements As IHTMLElementCollection
Dim Element As IHTMLElement

Set Browser = New InternetExplorer

Browser.navigate "http://www.hsbc.com/about-hsbc/leadership"

Do While Browser.Busy And Not Browser.readyState = READYSTATE_COMPLETE
    DoEvents
Loop

Set Document = Browser.Document

Set Elements = Document.getElementsByClassName("profile-col1")

For Each Element in Elements
    Debug.Print "[  name] " & Trim(Element.Children(1).Children(0).innerText)
    Debug.Print "[ title] " & Trim(Element.Children(1).Children(1).innerText)
Next Element

Set Document = Nothing
Set Browser = Nothing
End Sub

I have been looking at the HTMLElement.document property, seeing if it is like a fragment of the document but its either difficult to work with or just isnt what I think

Dim Fragment As HTMLDocument
Set Element = Document.getElementById("example") ' This works
Set Fragment = Element.document ' This doesn't

This also seems a long winded way to do it (although thats usually the way for vba imo). Anyone know if there is a simpler way to chain functions?

Document.getElementById("target").getElementsByTagName("tr") would be awesome...

like image 430
NickSlash Avatar asked Mar 03 '13 22:03

NickSlash


1 Answers

I would use XMLHTTP request to retrieve page content as much faster. Then it is easy enough to use querySelectorAll to apply a CSS class selector to grab by class name. Then you access the child elements by tag name and index.

Option Explicit
Public Sub GetInfo()
    Dim sResponse As String, html As HTMLDocument, elements As Object, i As Long

    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.hsbc.com/about-hsbc/leadership", False
        .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
        .send
        sResponse = StrConv(.responseBody, vbUnicode)
    End With
    Set html = New HTMLDocument
    With html
        .body.innerHTML = sResponse
        Set elements = .querySelectorAll(".profile-col1")
        For i = 0 To elements.Length - 1
            Debug.Print String(20, Chr$(61))
            Debug.Print elements.item(i).getElementsByTagName("a")(0).innerText
            Debug.Print elements.item(i).getElementsByTagName("p")(0).innerText
            Debug.Print elements.item(i).getElementsByTagName("p")(1).innerText
        Next
    End With
End Sub

References:

VBE > Tools > References > Microsoft HTML Object Library

like image 152
QHarr Avatar answered Oct 07 '22 19:10

QHarr