Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selenium + Flask/Falcon in Python - 502 Bad Gateway Error

I'm using selenium to make a headless scraping of a website within an endpoint of an API using Flask for Python. I made several tests and my selenium scraping code works perfectly within a script and while running as an API in the localhost. However, when I deploy the code in a remote server, the requests always return a 502 Bad Gateway error. It is weird because by logging I can see that the scraping is working correctly, but the server responds with 502 before the scraping finish processing, as if it was trying to set up a proxy and it fails. I also noticed that removing the time.sleep in my code makes it return a 200 although the result could be wrong because it doesn't give selenium the proper time to load the all the page to scrape.

I also tried to set up to use falcon instead of flask and I get a similar error. This is a sample of my recent code using Falcon:

class GetUrl(object):

    def on_get(self, req, resp):
        """
        Get Request
        :param req:
        :param resp:
        :return:
        """

        # read parameter
        req_body = req.bounded_stream.read()
        json_data = json.loads(req_body.decode('utf8'))
        url = json_data.get("url")

        # get the url
        options = Options()
        options.add_argument("--headless")
        driver = webdriver.Firefox(firefox_options=options)

        driver.get(url)
        time.sleep(5)
        result = False

        # check for outbound links
        content = driver.find_elements_by_xpath("//a[@class='_52c6']")
        if len(content) > 0:
            href = content[0].get_attribute("href")
            result = True

        driver.quit()

        # make the return
        return_doc = {"result": result}
        resp.body = json.dumps(return_doc, sort_keys=True, indent=2)
        resp.content_type = 'text/string'
        resp.append_header('Access-Control-Allow-Origin', "*")
        resp.status = falcon.HTTP_200

I saw some other similar issues like this, but even though I can see that there is a gunicorn running in my server, I don't have nginx, or at least it is not running where it should running. And I don't think Falcon uses it. So, what exactly am I doing wrong? Some light in this issue is highly appreciated, thank you!

like image 215
Thiago Avatar asked Sep 03 '21 02:09

Thiago


People also ask

Why do I get a 502 Bad gateway error?

A 502 Bad Gateway Error is a general indicator that there's something wrong with a website's server communication. Since it's just a generic error, it doesn't actually tell you the website's exact issue. When this happens, your website will serve an error web page to your site's visitors, like the photo below.

Is Error 502 a virus?

The 502 Bad Gateway error is an HTTP status code that means that one server on the internet received an invalid response from another server. 502 Bad Gateway errors are completely independent of your particular setup, meaning that you could see one in any browser, on any operating system, and on any device.


Video Answer


1 Answers

This might work:

from IPython.display import clear_output
import time as time
import json
!apt-get update
!apt install chromium-chromedriver
!which chromedriver
!pip install selenium
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.expected_conditions import presence_of_element_located
!pip install page_objects
import page_objects
from page_objects import PageObject, PageElement
time.sleep(1)
clear_output()

class GetUrl(object):

    def on_get(self, req, resp):
        """
        Get Request
        :param req:
        :param resp:
        :return:
        """

        # read parameter
        req_body = req.bounded_stream.read()
        json_data = json.loads(req_body.decode('utf8'))
        url = json_data.get("https://stackoverflow.com/questions/69038958/selenium-flask-falcon-in-python-502-bad-gateway-error/69546175#69546175")

        # get the url
        options = webdriver.ChromeOptions()
        options.add_argument('--headless')
        options.add_argument('--no-sandbox')
        options.add_argument('--disable-dev-shm-usage')
        driver = webdriver.Chrome('chromedriver',options = options)
        driver.implicitly_wait(3)

        driver.get("https://stackoverflow.com/questions/69038958/selenium-flask-falcon-in-python-502-bad-gateway-error/69546175#69546175")
        result = False

        # check for outbound links
        contentStorage = []
        content = driver.find_elements_by_tag_name('a')
        for i in content:
            contentStorage.append(i.get_attribute('text'))
            result = True

        #driver.quit()

        # make the return
        return_doc = {"result": result}
        resp.body = json.dumps(return_doc, sort_keys=True, indent=2)
        resp.content_type = 'text/string'
        resp.append_header('Access-Control-Allow-Origin', "*")
        resp.status = falcon.HTTP_200

However, I was testing it without using a class object, and also it's using Chrome instead of FireFox:

from IPython.display import clear_output
import time as time
!apt-get update
!apt install chromium-chromedriver
!which chromedriver
!pip install selenium
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.expected_conditions import presence_of_element_located
!pip install page_objects
import page_objects
from page_objects import PageObject, PageElement
time.sleep(1)
clear_output()
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',options = options)
driver.implicitly_wait(3)
driver.get('https://stackoverflow.com/questions/69038958/selenium-flask-falcon-in-python-502-bad-gateway-error/69546175#69546175')
content = driver.find_elements_by_tag_name('a')
contentStorage = []
for i in content:
  contentStorage.append(i.get_attribute('text'))
#driver.quit()
like image 167
Ori Yarden Avatar answered Oct 14 '22 05:10

Ori Yarden