I want to extract a U.S. Patent title from a url like
http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=6293874.PN.&OS=PN/6293874&RS=PN/6293874
(Update: as pointed out the comments, the patent title is not labeled "Title;" however, it consistently appears by itself above "Abstract" on the web page.) In most cases it is in the 7th child element of "body" or the 3rd "font" element in the document, but occasionally a notice at the top of the page to "** Please see images for: ( Certificate of Correction ) **" or "( Reexamination Certificate )" messes up both methods of extraction by inserting one additional child of "body" and three additional "font" elements before you get to the title.
However, the title seems to be consistently the first "font" element with the attribute "size" having a value of "+1". Unfortunately other elements have size="-1", including the aforementioned elements that are not always present, so it has to be specifically with that attribute and value. I have searched but can't figure out how to get elements by attribute and value. Here is my code:
Function Test_UpdateTitle(url As String)
Dim title As String
Dim pageSource As String
Dim xml_obj As XMLHTTP60
Set xml_obj = CreateObject("MSXML2.XMLHTTP")
xml_obj.Open "GET", url, False
xml_obj.send
pageSource = xml_obj.responseText
Set xml_obj = Nothing
Dim html_doc As HTMLDocument
Set html_doc = CreateObject("HTMLFile")
html_doc.body.innerHTML = pageSource
Dim fontElement As IHTMLElement
'Methods 1 and 2 fail in cases of a certificate of correction or reexamination certificate
'Method 1
' Dim body As IHTMLElement
' Set body = html_doc.getElementsByTagName("body").Item(0)
' Set fontElement = body.Children(6)
'Method 2
' Set fontElement = html_doc.getElementsByTagName("font").Item(3)
'Method 3
Dim n As Integer
For n = 3 To html_doc.getElementsByTagName("font").Length - 1
Set fontElement = html_doc.getElementsByTagName("font").Item(n)
If InStr(fontElement.innerText, "Please see") = 0 And _
InStr(fontElement.innerText, "( Certificate of Correction )") = 0 And _
InStr(fontElement.innerText, "( Reexamination Certificate )") = 0 And _
InStr(fontElement.innerText, " **") = 0 Then
Test_UpdateTitle = fontElement.innerText
Exit Function
End If
Next n
End Function
I should add that the " **" is not working to skip the the last element <b> **</b>
and I am getting " **" as the title where there is a notice to please see images. Is asterisk a wildcard character in this context?
You can try this. As long as its the first font tag with the size attribute and a value of "+1" this should work. I only tested with 3 different pages but they all returned the correct results.
Function Test_UpdateTitle(url)
title = "Title Not Found!"
Set xml_obj = CreateObject("MSXML2.XMLHTTP")
xml_obj.Open "GET", url, False
xml_obj.send
pageSource = xml_obj.responseText
Set xml_obj = Nothing
Set document = CreateObject("HTMLFile")
document.write pageSource
For i = 0 To document.getElementsByTagName("font").length - 1
If document.getElementsByTagName("font")(i).size = "+1" Then
title = document.getElementsByTagName("font")(i).innerText
Exit For
End If
Next
Test_UpdateTitle = title
End Function
MsgBox Test_UpdateTitle("http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=6293874.PN.&OS=PN/6293874&RS=PN/6293874")
MsgBox Test_UpdateTitle("http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=1&f=G&l=50&co1=AND&d=PTXT&s1=fight.TI.&OS=TTL/fight&RS=TTL/fight")
MsgBox Test_UpdateTitle("http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=14&f=G&l=50&d=PTXT&p=1&S1=search&OS=search&RS=search")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With