I want to get the source code of a website using selenium; find a particular element using BeautifulSoup; and then parse it back into selenium as a selenium.webdriver.remote.webelement object. Like so:
driver.get("www.google.com")
soup = BeautifulSoup(driver.source)
element = soup.find(title="Search")
element = Selenium.webelement(element)
element.click()
How can I achieve this?
The combination of Beautiful Soup and Selenium will do the job of dynamic scraping. Selenium automates web browser interaction from python. Hence the data rendered by JavaScript links can be made available by automating the button clicks with Selenium and then can be extracted by Beautiful Soup.
Developers should keep in mind some drawbacks when using Selenium for their web scraping projects. The most noticeable disadvantage is that it's not as fast as Beautiful Soup's HTTPS requests.
We can parse a website using Selenium and Beautiful Soup in Python. Web Scraping is a concept used to extract content from the web pages, used extensively in Data Science and metrics preparation. In Python, it is achieved with the BeautifulSoup package. To have BeautifulSoup along with Selenium, we should run the command −
The get_page () function below fetches a web page by URL, decodes it to UTF-8, and parses it into a BeautifulSoup object using the HTML parser. Once we have a BeautifulSoup object, we can start extracting information from the page. BeautifulSoup provides many find functions to locate elements inside the page and drill down deep nested elements.
The driver.page_source will return the full page HTML code. Locating data on a website is one of the main use cases for Selenium, either for a test suite (making sure that a specific element is present/absent on the page) or to extract the data and save it for further analysis (web scraping).
There are many methods available in the Selenium API to select elements on the page. You can use: As usual, the easiest way to locate an element is to open your Chrome dev tools and inspect the element that you need.
A general solution that worked for me is to compute the xpath of the bs4 element, then use that to find the element in selenium,
xpath = xpath_soup(soup_element)
selenium_element = driver.find_element_by_xpath(xpath)
...
import itertools
def xpath_soup(element):
"""
Generate xpath of soup element
:param element: bs4 text or node
:return: xpath as string
"""
components = []
child = element if element.name else element.parent
for parent in child.parents:
"""
@type parent: bs4.element.Tag
"""
previous = itertools.islice(parent.children, 0, parent.contents.index(child))
xpath_tag = child.name
xpath_index = sum(1 for i in previous if i.name == xpath_tag) + 1
components.append(xpath_tag if xpath_index == 1 else '%s[%d]' % (xpath_tag, xpath_index))
child = parent
components.reverse()
return '/%s' % '/'.join(components)
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get("http://www.google.com")
soup = BeautifulSoup(driver.page_source, 'html.parser')
search_soup_element = soup.find(title="Search")
input_element = soup.select('input.gsfi.lst-d-f')[0]
search_box = driver.find_element(by='name', value=input_element.attrs['name'])
search_box.send_keys('Hello World!')
search_box.send_keys(Keys.RETURN)
This pretty much works. I can see reason for working with both webdriver and BeautifulSoup but not necessarily for this example.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With