Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find/extract an HTML "font" element with attribute size="+1" using Excel VBA

Tags:

html

excel

vba

I want to extract a U.S. Patent title from a url like

http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=6293874.PN.&OS=PN/6293874&RS=PN/6293874

(Update: as pointed out the comments, the patent title is not labeled "Title;" however, it consistently appears by itself above "Abstract" on the web page.) In most cases it is in the 7th child element of "body" or the 3rd "font" element in the document, but occasionally a notice at the top of the page to "** Please see images for: ( Certificate of Correction ) **" or "( Reexamination Certificate )" messes up both methods of extraction by inserting one additional child of "body" and three additional "font" elements before you get to the title.

However, the title seems to be consistently the first "font" element with the attribute "size" having a value of "+1". Unfortunately other elements have size="-1", including the aforementioned elements that are not always present, so it has to be specifically with that attribute and value. I have searched but can't figure out how to get elements by attribute and value. Here is my code:

Function Test_UpdateTitle(url As String)
    Dim title As String
    Dim pageSource As String
    Dim xml_obj As XMLHTTP60
    Set xml_obj = CreateObject("MSXML2.XMLHTTP")
    xml_obj.Open "GET", url, False
    xml_obj.send
    pageSource = xml_obj.responseText
    Set xml_obj = Nothing

    Dim html_doc As HTMLDocument
    Set html_doc = CreateObject("HTMLFile")
    html_doc.body.innerHTML = pageSource
    Dim fontElement As IHTMLElement

'Methods 1 and 2 fail in cases of a certificate of correction or reexamination certificate

'Method 1

'    Dim body As IHTMLElement
'    Set body = html_doc.getElementsByTagName("body").Item(0)
'    Set fontElement = body.Children(6)

'Method 2
'    Set fontElement = html_doc.getElementsByTagName("font").Item(3)

'Method 3

    Dim n As Integer
    For n = 3 To html_doc.getElementsByTagName("font").Length - 1
        Set fontElement = html_doc.getElementsByTagName("font").Item(n)
        If InStr(fontElement.innerText, "Please see") = 0 And _
        InStr(fontElement.innerText, "( Certificate of Correction )") = 0 And _
        InStr(fontElement.innerText, "( Reexamination Certificate )") = 0 And _
        InStr(fontElement.innerText, " **") = 0 Then
            Test_UpdateTitle = fontElement.innerText
            Exit Function
        End If
    Next n

End Function

I should add that the " **" is not working to skip the the last element <b> **</b> and I am getting " **" as the title where there is a notice to please see images. Is asterisk a wildcard character in this context?

like image 517
PatentWookiee Avatar asked Oct 19 '22 23:10

PatentWookiee


1 Answers

You can try this. As long as its the first font tag with the size attribute and a value of "+1" this should work. I only tested with 3 different pages but they all returned the correct results.

Function Test_UpdateTitle(url)
    title = "Title Not Found!"
    Set xml_obj = CreateObject("MSXML2.XMLHTTP")
    xml_obj.Open "GET", url, False
    xml_obj.send
    pageSource = xml_obj.responseText
    Set xml_obj = Nothing

    Set document = CreateObject("HTMLFile")
    document.write pageSource   

    For i = 0 To document.getElementsByTagName("font").length - 1
        If document.getElementsByTagName("font")(i).size = "+1" Then
            title = document.getElementsByTagName("font")(i).innerText
            Exit For
        End If
    Next

    Test_UpdateTitle = title

End Function

MsgBox Test_UpdateTitle("http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=6293874.PN.&OS=PN/6293874&RS=PN/6293874")
MsgBox Test_UpdateTitle("http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=1&f=G&l=50&co1=AND&d=PTXT&s1=fight.TI.&OS=TTL/fight&RS=TTL/fight")
MsgBox Test_UpdateTitle("http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=14&f=G&l=50&d=PTXT&p=1&S1=search&OS=search&RS=search")
like image 102
Sean Wessell Avatar answered Nov 01 '22 09:11

Sean Wessell