I used selenium to scrap a scrolling website and conducted the code below <pre class="prettyprint"><code>import requests from bs4 import BeautifulSoup import csv from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait import unittest from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC import time import unittest import re output_file = open("Kijubi.csv", "w", newline='') class Crawling(unittest.TestCase): def setUp(self): self.driver = webdriver.Firefox() self.driver.set_window_size(1024, 768) self.base_url = "http://www.viatorcom.de/" self.accept_next_alert = True def test_sel(self): driver = self.driver delay = 3 driver.get(self.base_url + "de/7132/Seoul/d973-allthingstodo") for i in range(1,1): driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") time.sleep(2) html_source = driver.page_source data = html_source.encode("utf-8") </code></pre> My next step was to crawl specific information from the website like the price. Hence, I added the following code: <pre class="prettyprint"><code> all_spans = driver.find_elements_by_xpath("/html/body/div[5]/div/div[3]/div[2]/div[2]/div[1]/div[1]/div") print(all_spans) for price in all_spans: Header = driver.find_elements_by_xpath("/html/body/div[5]/div/div[3]/div[2]/div[2]/div[1]/div[1]/div/div[2]/div[2]/span[2]") for span in Header: print(span.text) </code></pre> But I get just one price instead all of them. Could you provide me feedback on what I could improve my code? Thanks:) EDIT Thanks to your guys I managed to get it running. Here is the additional code: <pre class="prettyprint"><code> elements = driver.find_elements_by_xpath("//div[@id='productList']/div/div") innerElements = 15 outerElements = len(elements)/innerElements print(innerElements, "\t", outerElements, "\t", len(elements)) for j in range(1, int(outerElements)): for i in range(1, int(innerElements)): headline = driver.find_element_by_xpath("//div[@id='productList']/div["+str(j)+"]/div["+str(i)+"]/div/div[2]/h2/a").text price = driver.find_element_by_xpath("//div[@id='productList']/div["+str(j)+"]/div["+str(i)+"]/div/div[2]/div[2]/span[2]").text deeplink = driver.find_element_by_xpath("//div[@id='productList']/div["+str(j)+"]/div["+str(i)+"]/div/div[2]/h2/a").get_attribute("href") print("Header: " + headline + " | " + "Price: " + price + " | " + "Deeplink: " + deeplink) </code></pre> Now my last issue is that I still do not get the last 20 prices back, which have a English description. I only get back the prices which have German description. For English ones, they do not get fetched although they share the same html structure. E.g. html structure for the English items <pre class="prettyprint"><code> headline = driver.find_element_by_xpath("//div[@id='productList']/div[6]/div[1]/div/div[2]/h2/a") </code></pre> Do you guys know what I have to modify? Any feedback is appreciated:)

To grab all prices on that page you should use such XPATH: <pre class="prettyprint"><code>Header = driver.find_elements_by_xpath("//span[contains(concat(' ', normalize-space(@class), ' '), 'price-amount')]") </code></pre> which means: find all span elements with class=price-amount, why so complex - see here But more simply to find the same elements is by CSS locator: <pre class="prettyprint"><code>.price-amount </code></pre>

Selenium find all elements by xpath

Tags:

python

selenium

web-crawler

I used selenium to scrap a scrolling website and conducted the code below

import requests
from bs4 import BeautifulSoup
import csv
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
import unittest
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import time
import unittest
import re

output_file = open("Kijubi.csv", "w", newline='')  

class Crawling(unittest.TestCase):
    def setUp(self):
        self.driver = webdriver.Firefox()
        self.driver.set_window_size(1024, 768)
        self.base_url = "http://www.viatorcom.de/"
        self.accept_next_alert = True

    def test_sel(self):
        driver = self.driver
        delay = 3
        driver.get(self.base_url + "de/7132/Seoul/d973-allthingstodo")
        for i in range(1,1):
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(2)
    html_source = driver.page_source
    data = html_source.encode("utf-8")

My next step was to crawl specific information from the website like the price.

Hence, I added the following code:

 all_spans = driver.find_elements_by_xpath("/html/body/div[5]/div/div[3]/div[2]/div[2]/div[1]/div[1]/div")
    print(all_spans)
    for price in all_spans:
        Header = driver.find_elements_by_xpath("/html/body/div[5]/div/div[3]/div[2]/div[2]/div[1]/div[1]/div/div[2]/div[2]/span[2]")
        for span in Header:
            print(span.text)

But I get just one price instead all of them. Could you provide me feedback on what I could improve my code? Thanks:)

EDIT

Thanks to your guys I managed to get it running. Here is the additional code:

    elements = driver.find_elements_by_xpath("//div[@id='productList']/div/div")

    innerElements = 15

    outerElements = len(elements)/innerElements

    print(innerElements,  "\t", outerElements, "\t", len(elements))

    for j in range(1, int(outerElements)):

        for i in range(1, int(innerElements)):


            headline = driver.find_element_by_xpath("//div[@id='productList']/div["+str(j)+"]/div["+str(i)+"]/div/div[2]/h2/a").text

            price = driver.find_element_by_xpath("//div[@id='productList']/div["+str(j)+"]/div["+str(i)+"]/div/div[2]/div[2]/span[2]").text
            deeplink = driver.find_element_by_xpath("//div[@id='productList']/div["+str(j)+"]/div["+str(i)+"]/div/div[2]/h2/a").get_attribute("href")

            print("Header: " + headline + " | " + "Price: " + price + " | " + "Deeplink: " + deeplink)

Now my last issue is that I still do not get the last 20 prices back, which have a English description. I only get back the prices which have German description. For English ones, they do not get fetched although they share the same html structure.

E.g. html structure for the English items

     headline =   driver.find_element_by_xpath("//div[@id='productList']/div[6]/div[1]/div/div[2]/h2/a")

Do you guys know what I have to modify? Any feedback is appreciated:)

796

asked Aug 17 '15 15:08

Serious Ruffy

1 Answers

To grab all prices on that page you should use such XPATH:

Header = driver.find_elements_by_xpath("//span[contains(concat(' ', normalize-space(@class), ' '), 'price-amount')]")

which means: find all span elements with class=price-amount, why so complex - see here

But more simply to find the same elements is by CSS locator:

.price-amount

answered Nov 09 '22 12:11

Viktor Chmel

Related questions
                            
                                Do numerical programming languages distinguish between a "largest finite number" and "infinity"?
                            
                                How to plot an image file on a 3D graph surface using Python? - not plotting as a flat plane
                            
                                Multiple solutions when doing ILP
                            
                                django rest framework - using viewsets
                            
                                No module named traitlets.config.application
                            
                                Python - Kriging (Gaussian Process) in scikit_learn
                            
                                Updating a NumPy array with another
                            
                                What happens to exceptions raised in a with statement expression?
                            
                                OverflowError: signed integer is greater than maximum when parsing date in python
                            
                                Can't call parent's method in list comprehension in child's initializer, but explicit loop works
                            
                                concatenate row values for the same index in pandas
                            
                                How to pass a class method as an argument to a function external to that class?
                            
                                Using Numba with scikit-learn
                            
                                Pass Exception to next except statement
                            
                                Python* to boost::python::object
                            
                                What is the R equivalent of pandas .resample() method?
                            
                                Two-dimensional np.digitize
                            
                                Mix two lists python
                            
                                How to raise an IndexError when slice indices are out of range?
                            
                                Why does logging.setLevel() has no effect here with Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With