Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to open and deal with a URL in a web browser in Python

Using Selenium package I am trying to open a URL in a browser. The browser can be Firefox or Google. Given URL is redirected to some other URL and the browser has to wait for this until its URL gets changed. Here's the code I'm using:

import time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
starttime = time.time()
browser = webdriver.Firefox(executable_path='\\somepath\\chromedriver.exe')
browser.get("http://patft.uspto.gov/netacgi/nph-Parser?patentnumber=3,930,293")
wait = WebDriverWait(browser, 5)
wait.until(lambda driver: browser.current_url !=patent )
url = browser.current_url
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')

for tag in soup.find_all(text=re.compile('Current U.S. Class:')):
    table = tag.findParent('table')
    result = table.find('tr').text
    browser.close()
    print(result)  # Current U.S. Class: 29/428 
    print(time.time() - starttime)

But this takes too much time (like 18 to 20 seconds) and I have a huge dataset of these URLs to work on. Is there any faster way to do this task?

like image 390
Roshni Amber Avatar asked Oct 24 '25 12:10

Roshni Amber


1 Answers

Looking at the response of the original URL, it only contains an HTML redirect to the new URL:

<HTML>
<HEAD>
<TITLE>Single Document</TITLE>
<META HTTP-EQUIV="REFRESH" CONTENT="1;URL=/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=3,930,293.PN.&OS=PN/3,930,293&RS=PN/3,930,293">
</HEAD>
</HTML>

Assuming response is always going to have the same format / content, you could easily capture the sub-URL from this response using RegEx, like so:

re.search('CONTENT="1;URL=(.+)"', r.text).group(1)

Then go to it. That all can be done by requests now, so you won't need to wait for Selenium!


Here's your code after using the trick above:

import time, requests, re
from bs4 import BeautifulSoup
start_time = time.time()
root_url = "http://patft.uspto.gov"
r = requests.get(root_url + "/netacgi/nph-Parser?patentnumber=3,930,293")
r = requests.get(root_url + re.search('CONTENT="1;URL=(.+)"', r.text).group(1))

soup = BeautifulSoup(r.text, 'lxml')

for tag in soup.find_all(string='Current U.S. Class:'):
    table = tag.findParent('table')
    result = table.find('tr').text
    print(result)
    print(time.time() - start_time)

Output:

Current U.S. Class: 29/428; 28/284; 28/297; 8/155 
2.2239434719085693
like image 166
Omar Einea Avatar answered Oct 27 '25 00:10

Omar Einea



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!