Using Selenium package I am trying to open a URL in a browser. The browser can be Firefox or Google. Given URL is redirected to some other URL and the browser has to wait for this until its URL gets changed. Here's the code I'm using:
import time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
starttime = time.time()
browser = webdriver.Firefox(executable_path='\\somepath\\chromedriver.exe')
browser.get("http://patft.uspto.gov/netacgi/nph-Parser?patentnumber=3,930,293")
wait = WebDriverWait(browser, 5)
wait.until(lambda driver: browser.current_url !=patent )
url = browser.current_url
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
for tag in soup.find_all(text=re.compile('Current U.S. Class:')):
table = tag.findParent('table')
result = table.find('tr').text
browser.close()
print(result) # Current U.S. Class: 29/428
print(time.time() - starttime)
But this takes too much time (like 18 to 20 seconds) and I have a huge dataset of these URLs to work on. Is there any faster way to do this task?
Looking at the response of the original URL, it only contains an HTML redirect to the new URL:
<HTML>
<HEAD>
<TITLE>Single Document</TITLE>
<META HTTP-EQUIV="REFRESH" CONTENT="1;URL=/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=3,930,293.PN.&OS=PN/3,930,293&RS=PN/3,930,293">
</HEAD>
</HTML>
Assuming response is always going to have the same format / content, you could easily capture the sub-URL from this response using RegEx, like so:
re.search('CONTENT="1;URL=(.+)"', r.text).group(1)
Then go to it. That all can be done by requests now, so you won't need to wait for Selenium!
Here's your code after using the trick above:
import time, requests, re
from bs4 import BeautifulSoup
start_time = time.time()
root_url = "http://patft.uspto.gov"
r = requests.get(root_url + "/netacgi/nph-Parser?patentnumber=3,930,293")
r = requests.get(root_url + re.search('CONTENT="1;URL=(.+)"', r.text).group(1))
soup = BeautifulSoup(r.text, 'lxml')
for tag in soup.find_all(string='Current U.S. Class:'):
table = tag.findParent('table')
result = table.find('tr').text
print(result)
print(time.time() - start_time)
Output:
Current U.S. Class: 29/428; 28/284; 28/297; 8/155
2.2239434719085693
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With