I'm using selenium to make a headless scraping of a website within an endpoint of an API using Flask for Python. I made several tests and my selenium scraping code works perfectly within a script and while running as an API in the localhost. However, when I deploy the code in a remote server, the requests always return a 502 Bad Gateway error. It is weird because by logging I can see that the scraping is working correctly, but the server responds with 502 before the scraping finish processing, as if it was trying to set up a proxy and it fails. I also noticed that removing the time.sleep
in my code makes it return a 200 although the result could be wrong because it doesn't give selenium the proper time to load the all the page to scrape.
I also tried to set up to use falcon instead of flask and I get a similar error. This is a sample of my recent code using Falcon:
class GetUrl(object):
def on_get(self, req, resp):
"""
Get Request
:param req:
:param resp:
:return:
"""
# read parameter
req_body = req.bounded_stream.read()
json_data = json.loads(req_body.decode('utf8'))
url = json_data.get("url")
# get the url
options = Options()
options.add_argument("--headless")
driver = webdriver.Firefox(firefox_options=options)
driver.get(url)
time.sleep(5)
result = False
# check for outbound links
content = driver.find_elements_by_xpath("//a[@class='_52c6']")
if len(content) > 0:
href = content[0].get_attribute("href")
result = True
driver.quit()
# make the return
return_doc = {"result": result}
resp.body = json.dumps(return_doc, sort_keys=True, indent=2)
resp.content_type = 'text/string'
resp.append_header('Access-Control-Allow-Origin', "*")
resp.status = falcon.HTTP_200
I saw some other similar issues like this, but even though I can see that there is a gunicorn
running in my server, I don't have nginx, or at least it is not running where it should running. And I don't think Falcon uses it. So, what exactly am I doing wrong? Some light in this issue is highly appreciated, thank you!
A 502 Bad Gateway Error is a general indicator that there's something wrong with a website's server communication. Since it's just a generic error, it doesn't actually tell you the website's exact issue. When this happens, your website will serve an error web page to your site's visitors, like the photo below.
The 502 Bad Gateway error is an HTTP status code that means that one server on the internet received an invalid response from another server. 502 Bad Gateway errors are completely independent of your particular setup, meaning that you could see one in any browser, on any operating system, and on any device.
This might work:
from IPython.display import clear_output
import time as time
import json
!apt-get update
!apt install chromium-chromedriver
!which chromedriver
!pip install selenium
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.expected_conditions import presence_of_element_located
!pip install page_objects
import page_objects
from page_objects import PageObject, PageElement
time.sleep(1)
clear_output()
class GetUrl(object):
def on_get(self, req, resp):
"""
Get Request
:param req:
:param resp:
:return:
"""
# read parameter
req_body = req.bounded_stream.read()
json_data = json.loads(req_body.decode('utf8'))
url = json_data.get("https://stackoverflow.com/questions/69038958/selenium-flask-falcon-in-python-502-bad-gateway-error/69546175#69546175")
# get the url
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',options = options)
driver.implicitly_wait(3)
driver.get("https://stackoverflow.com/questions/69038958/selenium-flask-falcon-in-python-502-bad-gateway-error/69546175#69546175")
result = False
# check for outbound links
contentStorage = []
content = driver.find_elements_by_tag_name('a')
for i in content:
contentStorage.append(i.get_attribute('text'))
result = True
#driver.quit()
# make the return
return_doc = {"result": result}
resp.body = json.dumps(return_doc, sort_keys=True, indent=2)
resp.content_type = 'text/string'
resp.append_header('Access-Control-Allow-Origin', "*")
resp.status = falcon.HTTP_200
However, I was testing it without using a class object, and also it's using Chrome instead of FireFox:
from IPython.display import clear_output
import time as time
!apt-get update
!apt install chromium-chromedriver
!which chromedriver
!pip install selenium
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.expected_conditions import presence_of_element_located
!pip install page_objects
import page_objects
from page_objects import PageObject, PageElement
time.sleep(1)
clear_output()
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',options = options)
driver.implicitly_wait(3)
driver.get('https://stackoverflow.com/questions/69038958/selenium-flask-falcon-in-python-502-bad-gateway-error/69546175#69546175')
content = driver.find_elements_by_tag_name('a')
contentStorage = []
for i in content:
contentStorage.append(i.get_attribute('text'))
#driver.quit()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With