Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use an already open webpage(with selenium) to beautifulsoup?

I have a web page open and logged in using webdriver code. Using webdriver for this because the page requires login and various other actions before I am set to scrape.

The aim is to scrape data from this open page. Need to find links and open them, so there will be a lot of combination between selenium webdriver and BeautifulSoup.

I looked at the documentation for bs4 and the BeautifulSoup(open("ccc.html")) throws an error

soup = bs4.BeautifulSoup(open("https://m/search.mp?ss=Pr+Dn+Ts"))

OSError: [Errno 22] Invalid argument: 'https://m/search.mp?ss=Pr+Dn+Ts'

I assume this is because its not a .html?

like image 981
Sid Avatar asked Feb 06 '23 06:02

Sid


1 Answers

You are trying to open a page by a web address. open() would not do that, use urlopen():

from urllib.request import urlopen  # Python 3
# from urllib2 import urlopen  # Python 2

url = "your target url here"
soup = bs4.BeautifulSoup(urlopen(url), "html.parser")

Or, use an HTTP for humans - requests library:

import requests

response = requests.get(url)
soup = bs4.BeautifulSoup(response.content, "html.parser")

Also note that it is strongly advisable to specify a parser explicitly - I've used html.parser in this case, there are other parsers available.


I want to use the exact same page(same instance)

A common way to do it is to get the driver.page_source and pass it to BeautifulSoup for further parsing:

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Firefox()
driver.get(url)

# wait for page to load..

source = driver.page_source
driver.quit()  # remove this line to leave the browser open

soup = BeautifulSoup(source, "html.parser")
like image 62
alecxe Avatar answered Feb 07 '23 19:02

alecxe