Fastest way to open and deal with a URL in a web browser in Python

Question

Using Selenium package I am trying to open a URL in a browser. The browser can be Firefox or Google. Given URL is redirected to some other URL and the browser has to wait for this until its URL gets changed. Here's the code I'm using:

import time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
starttime = time.time()
browser = webdriver.Firefox(executable_path='\somepath\chromedriver.exe')
browser.get("http://patft.uspto.gov/netacgi/nph-Parser?patentnumber=3,930,293")
wait = WebDriverWait(browser, 5)
wait.until(lambda driver: browser.current_url !=patent )
url = browser.current_url
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')

for tag in soup.find_all(text=re.compile('Current U.S. Class:')):
    table = tag.findParent('table')
    result = table.find('tr').text
    browser.close()
    print(result)  # Current U.S. Class: 29/428 
    print(time.time() - starttime)

But this takes too much time (like 18 to 20 seconds) and I have a huge dataset of these URLs to work on. Is there any faster way to do this task?

Omar Einea · Accepted Answer

Looking at the response of the original URL, it only contains an HTML redirect to the new URL:

<HTML>
<HEAD>
<TITLE>Single Document</TITLE>
<META HTTP-EQUIV="REFRESH" CONTENT="1;URL=/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=3,930,293.PN.&OS=PN/3,930,293&RS=PN/3,930,293">
</HEAD>
</HTML>

Assuming response is always going to have the same format / content, you could easily capture the sub-URL from this response using RegEx, like so:

re.search('CONTENT="1;URL=(.+)"', r.text).group(1)

Then go to it. That all can be done by requests now, so you won't need to wait for Selenium!

Here's your code after using the trick above:

import time, requests, re
from bs4 import BeautifulSoup
start_time = time.time()
root_url = "http://patft.uspto.gov"
r = requests.get(root_url + "/netacgi/nph-Parser?patentnumber=3,930,293")
r = requests.get(root_url + re.search('CONTENT="1;URL=(.+)"', r.text).group(1))

soup = BeautifulSoup(r.text, 'lxml')

for tag in soup.find_all(string='Current U.S. Class:'):
    table = tag.findParent('table')
    result = table.find('tr').text
    print(result)
    print(time.time() - start_time)

Output:

Current U.S. Class: 29/428; 28/284; 28/297; 8/155 
2.2239434719085693

Fastest way to open and deal with a URL in a web browser in Python

Tags:

python

python-3.x

beautifulsoup

selenium-webdriver

web-scraping

Roshni Amber

1 Answers

Omar Einea

Recent Activity

Donate For Us

Fastest way to open and deal with a URL in a web browser in Python

Tags:

python

python-3.x

beautifulsoup

selenium-webdriver

web-scraping

Roshni Amber

1 Answers

Omar Einea

Related questions

Recent Activity

Donate For Us