Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scraping headlines from news website with infinite loading

I want to scrape headlines from this website: https://www.marketwatch.com/latest-news?mod=top_nav

I need to load earlier news, so click on the blue button "SEE MORE" is necessary.

I created this code, but didn't worked:

from bs4 import BeautifulSoup
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
u = 'https://www.marketwatch.com/latest-news?mod=top_nav' #US Business


driver = webdriver.Chrome(executable_path=r"C:/chromedriver.exe")
driver.maximize_window()
driver.get(u)
time.sleep(10)
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CLASS_NAME,'close-btn'))).click()
time.sleep(10)

driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
for i in range(3):
        element =WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR,'component.component--module.more-headlines div.group.group--buttons.cover > a.btn.btn--secondary.js--more-headlines)))
        driver.execute_script("arguments[0].scrollIntoView();", element)
        element.click()
        time.sleep(5)
        driver.execute_script("arguments[0].scrollIntoView();", element)

        print(f'click {i} done')
soup = BeautifulSoup(driver.page_source, 'html.parser')

driver.quit()

It returns this error:

raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
like image 423
khaled koubaa Avatar asked Dec 02 '20 07:12

khaled koubaa


1 Answers

Something like this will be more reliable:

for i in range(3):
  driver.execute_script('''
    document.querySelector('a.js--more-headlines').click()
  ''')
  time.sleep(1)

Note you don't have to scroll into view when you click from javascript

like image 160
pguardiario Avatar answered Nov 16 '22 03:11

pguardiario