I have a web page open and logged in using webdriver code. Using webdriver for this because the page requires login and various other actions before I am set to scrape.
The aim is to scrape data from this open page. Need to find links and open them, so there will be a lot of combination between selenium webdriver and BeautifulSoup.
I looked at the documentation for bs4 and the BeautifulSoup(open("ccc.html"))
throws an error
soup = bs4.BeautifulSoup(open("https://m/search.mp?ss=Pr+Dn+Ts"))
OSError: [Errno 22] Invalid argument: 'https://m/search.mp?ss=Pr+Dn+Ts'
I assume this is because its not a .html
?
You are trying to open a page by a web address. open()
would not do that, use urlopen()
:
from urllib.request import urlopen # Python 3
# from urllib2 import urlopen # Python 2
url = "your target url here"
soup = bs4.BeautifulSoup(urlopen(url), "html.parser")
Or, use an HTTP for humans - requests
library:
import requests
response = requests.get(url)
soup = bs4.BeautifulSoup(response.content, "html.parser")
Also note that it is strongly advisable to specify a parser explicitly - I've used html.parser
in this case, there are other parsers available.
I want to use the exact same page(same instance)
A common way to do it is to get the driver.page_source
and pass it to BeautifulSoup
for further parsing:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get(url)
# wait for page to load..
source = driver.page_source
driver.quit() # remove this line to leave the browser open
soup = BeautifulSoup(source, "html.parser")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With