I used selenium to scrap a scrolling website and conducted the code below
import requests
from bs4 import BeautifulSoup
import csv
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
import unittest
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import time
import unittest
import re
output_file = open("Kijubi.csv", "w", newline='')
class Crawling(unittest.TestCase):
def setUp(self):
self.driver = webdriver.Firefox()
self.driver.set_window_size(1024, 768)
self.base_url = "http://www.viatorcom.de/"
self.accept_next_alert = True
def test_sel(self):
driver = self.driver
delay = 3
driver.get(self.base_url + "de/7132/Seoul/d973-allthingstodo")
for i in range(1,1):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)
html_source = driver.page_source
data = html_source.encode("utf-8")
My next step was to crawl specific information from the website like the price.
Hence, I added the following code:
all_spans = driver.find_elements_by_xpath("/html/body/div[5]/div/div[3]/div[2]/div[2]/div[1]/div[1]/div")
print(all_spans)
for price in all_spans:
Header = driver.find_elements_by_xpath("/html/body/div[5]/div/div[3]/div[2]/div[2]/div[1]/div[1]/div/div[2]/div[2]/span[2]")
for span in Header:
print(span.text)
But I get just one price instead all of them. Could you provide me feedback on what I could improve my code? Thanks:)
EDIT
Thanks to your guys I managed to get it running. Here is the additional code:
elements = driver.find_elements_by_xpath("//div[@id='productList']/div/div")
innerElements = 15
outerElements = len(elements)/innerElements
print(innerElements, "\t", outerElements, "\t", len(elements))
for j in range(1, int(outerElements)):
for i in range(1, int(innerElements)):
headline = driver.find_element_by_xpath("//div[@id='productList']/div["+str(j)+"]/div["+str(i)+"]/div/div[2]/h2/a").text
price = driver.find_element_by_xpath("//div[@id='productList']/div["+str(j)+"]/div["+str(i)+"]/div/div[2]/div[2]/span[2]").text
deeplink = driver.find_element_by_xpath("//div[@id='productList']/div["+str(j)+"]/div["+str(i)+"]/div/div[2]/h2/a").get_attribute("href")
print("Header: " + headline + " | " + "Price: " + price + " | " + "Deeplink: " + deeplink)
Now my last issue is that I still do not get the last 20 prices back, which have a English description. I only get back the prices which have German description. For English ones, they do not get fetched although they share the same html structure.
E.g. html structure for the English items
headline = driver.find_element_by_xpath("//div[@id='productList']/div[6]/div[1]/div/div[2]/h2/a")
Do you guys know what I have to modify? Any feedback is appreciated:)
We can find an element using the xpath locator with Selenium webdriver. To identify the element with xpath, the expression should be //tagname[@attribute='value']. To identify the element with xpath, the expression should be //tagname[@class='value']. There can be two types of xpath – relative and absolute.
The findElement(By. xpath) method is used to identify an element which matches with the xpath locator passed as a parameter to this method. The findElements(By. xpath) method is used to identify a collection of elements which match with xpath locator passed as a parameter to that method.
To grab all prices on that page you should use such XPATH:
Header = driver.find_elements_by_xpath("//span[contains(concat(' ', normalize-space(@class), ' '), 'price-amount')]")
which means: find all span elements with class=price-amount, why so complex - see here
But more simply to find the same elements is by CSS locator:
.price-amount
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With