Script Suddenly Stops Crawling Without Error or Exception

Tags:

I'm not sure why, but my script always stops crawling once it hits page 9. There are no errors, exceptions, or warnings, so I'm kind of at a loss.

Can somebody help me out?

P.S. Here is the full script in case anybody wants to test it for themselves!

def initiate_crawl():
    def refresh_page(url):
        ff = create_webdriver_instance()
        ff.get(url)
        ff.find_element(By.XPATH, '//*[@id="FilterItemView_sortOrder_dropdown"]/div/span[2]/span/span/span/span').click()
        ff.find_element(By.XPATH, '//a[contains(text(), "Discount - High to Low")]').click()
        items = WebDriverWait(ff, 15).until(
            EC.visibility_of_all_elements_located((By.XPATH, '//div[contains(@id, "100_dealView_")]'))
        )
        print(len(items))
        for count, item in enumerate(items):
            slashed_price = item.find_elements(By.XPATH, './/span[contains(@class, "a-text-strike")]')
            active_deals = item.find_elements(By.XPATH, './/*[contains(text(), "Add to Cart")]')
            if len(slashed_price) > 0 and len(active_deals) > 0:
                product_title = item.find_element(By.ID, 'dealTitle').text
                if product_title not in already_scraped_product_titles:
                    already_scraped_product_titles.append(product_title)
                    url = ff.current_url
                    ff.quit()
                    refresh_page(url)
                    break
            if count+1 is len(items):
                try:
                    next_button = WebDriverWait(ff, 15).until(
                        EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→')
                    )
                    ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()
                    url = ff.current_url
                    ff.quit()
                    refresh_page(url)
                except Exception as error:
                    print(error)
                    ff.quit()

    refresh_page('https://www.amazon.ca/gp/goldbox/ref=gbps_ftr_s-3_4bc8_dct_10-?gb_f_c2xvdC0z=sortOrder:BY_SCORE,discountRanges:10-25%252C25-50%252C50-70%252C70-&pf_rd_p=f5836aee-0969-4c39-9720-4f0cacf64bc8&pf_rd_s=slot-3&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A3DWYIK6Y9EEQB&pf_rd_r=CQ7KBNXT36G95190QJB1&ie=UTF8')

initiate_crawl()

Printing the length of items invokes some strange behaviour too. Instead of it always returning 32, which would correspond to the number of items on each page, it prints 32 for the first page, 64 for the second, 96 for the third, so on and so forth. I fixed this by using //div[contains(@id, "100_dealView_")]/div[contains(@class, "dealContainer")] instead of //div[contains(@id, "100_dealView_")] as the XPath for the items variable. I'm hoping this is the reason why it runs into issues on page 9. I'm running tests right now. Update: It is now scraping page 10 and beyond, so the issue is resolved.

318

asked Oct 07 '18 19:10

oldboy

Video Answer

1 Answers

As per your 10^th revision of this question the error message...

HTTPConnectionPool(host='127.0.0.1', port=58992): Max retries exceeded with url: /session/e8beed9b-4faa-4e91-a659-56761cb604d7/element (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000022D31378A58>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it'))

...implies that the get() method failed raising HTTPConnectionPool error with a message Max retries exceeded.

A couple of things:

As per the discussion max-retries-exceeded exceptions are confusing the traceback is somewhat misleading. Requests wraps the exception for the users convenience. The original exception is part of the message displayed.
Requests never retries (it sets the retries=0 for urllib3's HTTPConnectionPool), so the error would have been much more canonical without the MaxRetryError and HTTPConnectionPool keywords. So an ideal Traceback would have been:
```
  NewConnectionError(<class 'socket.error'>: [Errno 10061] No connection could be made because the target machine actively refused it)
```
You will find a detailed explaination in MaxRetryError: HTTPConnectionPool: Max retries exceeded (Caused by ProtocolError('Connection aborted.', error(111, 'Connection refused')))

Solution

As per the Release Notes of Selenium 3.14.1:

* Fix ability to set timeout for urllib3 (#6286)

The Merge is: repair urllib3 can't set timeout!

Conclusion

Once you upgrade to Selenium 3.14.1 you will be able to set the timeout and see canonical Tracebacks and would be able to take required action.

References

A couple of relevent references:

Adding max_retries as an argument
Removed the bundled charade and urllib3.
Third party libraries committed verbatim

This usecase

I have taken your full script from codepen.io - A PEN BY Anthony. I had to make a few tweaks to your existing code as follows:

As you have used:

  ua_string = random.choice(ua_strings)

You have to mandatorily import random as:

    import random

You have created the variable next_button but haven't used it. I have clubbed up the following four lines:

  next_button = WebDriverWait(ff, 15).until(
                  EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→')
              )
  ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()

As:

  WebDriverWait(ff, 15).until(EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→'))
  ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()

Your modified code block will be:

  # -*- coding: utf-8 -*-
  from selenium import webdriver
  from selenium.webdriver.firefox.options import Options
  from selenium.webdriver.common.by import By
  from selenium.webdriver.support import expected_conditions as EC
  from selenium.webdriver.support.ui import WebDriverWait
  import time
  import random


  """ Set Global Variables
  """
  ua_strings = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36']
  already_scraped_product_titles = []



  """ Create Instances of WebDriver
  """
  def create_webdriver_instance():
      ua_string = random.choice(ua_strings)
      profile = webdriver.FirefoxProfile()
      profile.set_preference('general.useragent.override', ua_string)
      options = Options()
      options.add_argument('--headless')
      return webdriver.Firefox(profile)



  """ Construct List of UA Strings
  """
  def fetch_ua_strings():
      ff = create_webdriver_instance()
      ff.get('https://techblog.willshouse.com/2012/01/03/most-common-user-agents/')
      ua_strings_ff_eles = ff.find_elements_by_xpath('//td[@class="useragent"]')
      for ua_string in ua_strings_ff_eles:
          if 'mobile' not in ua_string.text and 'Trident' not in ua_string.text:
              ua_strings.append(ua_string.text)
      ff.quit()



  """ Log in to Amazon to Use SiteStripe in order to Generate Affiliate Links
  """
  def log_in(ff):
      ff.find_element(By.XPATH, '//a[@id="nav-link-yourAccount"] | //a[@id="nav-link-accountList"]').click()
      ff.find_element(By.ID, 'ap_email').send_keys('[email protected]')
      ff.find_element(By.ID, 'continue').click()
      ff.find_element(By.ID, 'ap_password').send_keys('lo0kyLoOkYig0t4h')
      ff.find_element(By.NAME, 'rememberMe').click()
      ff.find_element(By.ID, 'signInSubmit').click()



  """ Build Lists of Product Page URLs
  """
  def initiate_crawl():
      def refresh_page(url):
      ff = create_webdriver_instance()
      ff.get(url)
      ff.find_element(By.XPATH, '//*[@id="FilterItemView_sortOrder_dropdown"]/div/span[2]/span/span/span/span').click()
      ff.find_element(By.XPATH, '//a[contains(text(), "Discount - High to Low")]').click()
      items = WebDriverWait(ff, 15).until(
          EC.visibility_of_all_elements_located((By.XPATH, '//div[contains(@id, "100_dealView_")]'))
      )
      for count, item in enumerate(items):
          slashed_price = item.find_elements(By.XPATH, './/span[contains(@class, "a-text-strike")]')
          active_deals = item.find_elements(By.XPATH, './/*[contains(text(), "Add to Cart")]')
          # For Groups of Items on Sale
          # active_deals = //*[contains(text(), "Add to Cart") or contains(text(), "View Deal")]
          if len(slashed_price) > 0 and len(active_deals) > 0:
              product_title = item.find_element(By.ID, 'dealTitle').text
              if product_title not in already_scraped_product_titles:
                  already_scraped_product_titles.append(product_title)
                  url = ff.current_url
                  # Scrape Details of Each Deal
                  #extract(ff, item.find_element(By.ID, 'dealImage').get_attribute('href'))
                  print(product_title[:10])
                  ff.quit()
                  refresh_page(url)
                  break
          if count+1 is len(items):
              try:
                  print('')
                  print('new page')
                  WebDriverWait(ff, 15).until(EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→'))
                  ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()
                  time.sleep(10)
                  url = ff.current_url
                  print(url)
                  print('')
                  ff.quit()
                  refresh_page(url)
              except Exception as error:
                  """
                  ff.find_element(By.XPATH, '//*[@id="pagination-both-004143081429407891"]/ul/li[9]/a').click()
                  url = ff.current_url
                  ff.quit()
                  refresh_page(url)
                  """
                  print('cannot find ff.find_element(By.PARTIAL_LINK_TEXT, "Next?")')
                  print('Because of... {}'.format(error))
                  ff.quit()

      refresh_page('https://www.amazon.ca/gp/goldbox/ref=gbps_ftr_s-3_4bc8_dct_10-?gb_f_c2xvdC0z=sortOrder:BY_SCORE,discountRanges:10-25%252C25-50%252C50-70%252C70-&pf_rd_p=f5836aee-0969-4c39-9720-4f0cacf64bc8&pf_rd_s=slot-3&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A3DWYIK6Y9EEQB&pf_rd_r=CQ7KBNXT36G95190QJB1&ie=UTF8')

  #def extract_info(ff, url):
  fetch_ua_strings()
  initiate_crawl()

Console Output: With Selenium v3.14.0 and Firefox Quantum v62.0.3, I can extract the following output on the console:

  J.Rosée Si
  B.Catcher 
  Bluetooth4
  FRAM G4164
  Major Crim
  20% off Oh
  True Blood
  Prime-Line
  Marathon 3
  True Blood
  B.Catcher 
  4 Film Fav
  True Blood
  Texture Pa
  Westinghou
  True Blood
  ThermoPro 
  ...
  ...
  ...

Note: I could have optimized your code and performed the same web scraping operations initializing the Firefox Browser Client only once and traverse through various products and their details. But to preserve your logic and innovation I have suggested the minimal changes required to get you through.

125

answered Oct 23 '22 08:10

undetected Selenium

Related questions
                            
                                How to Bind and Send from Google Cloud Forwarding Rule IP Address?
                            
                                Upload a CSV file and read it in Bokeh Web app
                            
                                Test Driven Development (TDD) for Web Scraping
                            
                                Bokeh: DataTable - how to set selected rows
                            
                                python docx.opc.exceptions.PackageNotFoundError: Package not found when opening Document
                            
                                pidbox received method enable_events() [reply_to:None ticket:None] in Django-Celery
                            
                                Check view method parameter name in Django class based views
                            
                                Segmentation with Single Point Class Annotations via Graph Cuts?
                            
                                SharePlum error : "Can't get User Info List"
                            
                                Getting glob to follow symlinks in Python
                            
                                Pythonocc/Opencascade | Create pipe along straight lines through points, profile wont change normal
                            
                                Nothing is being detected in Tensorflow Object detection API
                            
                                Which operator (+ vs +=) should be used for performance? (In-place Vs not-in-place)
                            
                                difference between pandas read sql query and read sql table
                            
                                Is there a keras method to split data?
                            
                                Is it reliable and documented how PYTHONPATH populates the sys.path?
                            
                                python function call with/without list comprehension [duplicate]
                            
                                inputs for nDCG in sklearn
                            
                                Not able to print pexpect response via python logger
                            
                                Python 3.7 Can't connect to HTTPS URL because the SSLmodule is not available

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Script Suddenly Stops Crawling Without Error or Exception

Tags:

python

python-requests

selenium

geckodriver

urllib3