Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to access a site via a headless driver without being denied permission

I am trying to retrieve the html code of a site using a headless chrome driver. However I get a "permission denied" message. If I use a "regular" driver it all works fine.

Is there any way to bypass that?

It's my first post so I do apologize for any potential mistakes in formatting

from selenium import webdriver

#Headless driver 

chrome_options = webdriver.ChromeOptions()

chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')                                             

driver1 = webdriver.Chrome(executable_path='./chromedriver', options=chrome_options, 
service_args=['--verbose', '--log-path=/tmp/chromedriver.log'])

driver1.get('https://www.size.co.uk/')
html = driver1.page_source
html

The message I get is:

<html xmlns="http://www.w3.org/1999/xhtml"><head>\n<title>Access Denied</title>\n</head><body>\n<h1>Access Denied</h1>\n \nYou don\'t have permission to access "http://www.size.co.uk/" on this server.<p>\nReference #18.ac81655f.1548818550.73b12da\n\n\n</p></body></html>

Regular driver:

driver = webdriver.Chrome('./chromedriver')
driver.get('https://www.size.co.uk/')
html = driver.page_source
driver.quit()
html

Ideally, I'd like the output to be as in the latter case without having new windows popping up every couple seconds.

like image 709
Michal B. Avatar asked Dec 08 '22 12:12

Michal B.


1 Answers

Adding in the following code snippet got the page to return for me:

user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36'    
chrome_options.add_argument('user-agent={0}'.format(user_agent))

The site is obviously checking for headless browsers and then denying them access. Here's an article on avoiding detection: Making Chrome Headless Undetectable

To get the user agent being used by the driver you can run the following command:

driver.execute_script("return navigator.userAgent")

Chromes headless user agent is something like this:

u'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/71.0.3578.98 Safari/537.36'

like image 85
cullzie Avatar answered Jan 05 '23 00:01

cullzie