I am using Selenium on python 3.7.2 to scrape from 9gag for a school project.
I am running chrome 80.0.3987.122 on MacOS. My chromedriver version is the one offered for version 80. The below code is how I use my driver:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options as c_opt
options = c_opt()
options.headless = True
driver = webdriver.Chrome(executable_path=PATH_TO_DRIVER, chrome_options=options)
driver.get('https://www.9gag.com'))
with open('source.html', 'w') as f:
f.write(driver.page_source)
everything worked fine yesterday. i would run this code and open the source file and see the first couple of 9gag articles. Starting this morning my source result shows a loading graphic, as if it did not finish loading the javascript.
I know this is not an issue with the website since I tried this again with a headless firefox driver and a non-headless chrome driver and everything worked as expected.
The driver does not show any errors as far as I can tell.
My number one suspect is chrome. I think maybe it was updated somehow and selenium or the driver don't know how to handle it. I really need to use headless since without it I am forced to focus on the chrome window (this may be a mac issue, but still).
Has anyone encountered this behavior?
UPDATE
I see that my issue happens only when i visit specific categories, for example https://9gag.com/funny. so i saved the output from there and loaded it on chrome and got the following:
It seems that headless chrome is falling into a captcha and cannot proceed to load the page. How is it possible that this just started happening now and is there something that can be done? how can we explain that geckodriver for firefox somehow overcomes this (it has its own issues, but at least it loads the page)?
You can try adding these 2 flags to your options. The first one will make it so the "navigator.webdriver=true" variable in javascript doesn't show. Sites can access that variable to check if your using automation and block you or make you solve a captcha.
The next one is a user agent. Go ahead and set that to something that looks legit.
options.add_argument('disable-blink-features=AutomationControlled')
options.add_argument('user-agent=Type user agent here')
Hopefully this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With