Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unable to use proxies one by one until there is a valid response

I've written a script in python's scrapy to make a proxied requests using either of the newly generated proxies by get_proxies() method. I used requests module to fetch the proxies in order to reuse them in the script. However, the problem is the proxy my script chooses to use may not be the good one always so sometimes it doesn't fetch valid response.

How can I let my script keep trying with different proxies until there is a valid response?

My script so far:

import scrapy
import random
import requests
from itertools import cycle
from bs4 import BeautifulSoup
from scrapy.http.request import Request
from scrapy.crawler import CrawlerProcess

class ProxySpider(scrapy.Spider):
    name = "sslproxies"
    check_url = "https://stackoverflow.com/questions/tagged/web-scraping"
    proxy_link = "https://www.sslproxies.org/"

    def start_requests(self):
        proxylist = self.get_proxies()
        random.shuffle(proxylist)
        proxy_ip_port = next(cycle(proxylist))
        print(proxy_ip_port)       #Checking out the proxy address
        request = scrapy.Request(self.check_url, callback=self.parse,errback=self.errback_httpbin,dont_filter=True)
        request.meta['proxy'] = "http://{}".format(proxy_ip_port)
        yield request

    def get_proxies(self):   
        response = requests.get(self.proxy_link)
        soup = BeautifulSoup(response.text,"lxml")
        proxy = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tbody tr") if "yes" in item.text]
        return proxy

    def parse(self, response):
        print(response.meta.get("proxy"))  #Compare this to the earlier one whether they both are the same

    def errback_httpbin(self, failure):
        print("Failure: "+str(failure))

if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0', 
        'DOWNLOAD_TIMEOUT' : 5,  
    })
    c.crawl(ProxySpider)
    c.start()

PS My intension is to seek any solution the way I've started here.

like image 757
robots.txt Avatar asked Feb 21 '19 06:02

robots.txt


2 Answers

As we know http response needs to pass all middlewares in order to reach spider methods.

It means that only requests with valid proxies can proceed to spider callback functions.

In order to use valid proxies we need to check ALL proxies first and after that choose only from valid proxies.

When our previously chosen proxy doesn't work anymore - we mark this proxy as not valid and choose new one from remaining valid proxies in spider errback.

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.http.request import Request

class ProxySpider(scrapy.Spider):
    name = "sslproxies"
    check_url = "https://stackoverflow.com/questions/tagged/web-scraping"
    proxy_link = "https://www.sslproxies.org/"
    current_proxy = ""
    proxies = {}

    def start_requests(self):
        yield Request(self.proxy_link,callback=self.parse_proxies)

    def parse_proxies(self,response):

        for row in response.css("table#proxylisttable tbody tr"):
             if "yes" in row.extract():
                 td = row.css("td::text").extract()
                 self.proxies["http://{}".format(td[0]+":"+td[1])]={"valid":False}

        for proxy in self.proxies.keys():
             yield Request(self.check_url,callback=self.parse,errback=self.errback_httpbin,
                           meta={"proxy":proxy,
                                 "download_slot":proxy},
                           dont_filter=True)

    def parse(self, response):
        if "proxy" in response.request.meta.keys():
            #As script reaches this parse method we can mark current proxy as valid
            self.proxies[response.request.meta["proxy"]]["valid"] = True
            print(response.meta.get("proxy"))
            if not self.current_proxy:
                #Scraper reaches this code line on first valid response
                self.current_proxy = response.request.meta["proxy"]
                #yield Request(next_url, callback=self.parse_next,
                #              meta={"proxy":self.current_proxy,
                #                    "download_slot":self.current_proxy})

    def errback_httpbin(self, failure):
        if "proxy" in failure.request.meta.keys():
            proxy = failure.request.meta["proxy"]
            if proxy == self.current_proxy:
                #If current proxy after our usage becomes not valid
                #Mark it as not valid
                self.proxies[proxy]["valid"] = False
                for ip_port in self.proxies.keys():
                    #And choose valid proxy from self.proxies
                    if self.proxies[ip_port]["valid"]:
                        failure.request.meta["proxy"] = ip_port
                        failure.request.meta["download_slot"] = ip_port
                        self.current_proxy = ip_port
                        return failure.request
        print("Failure: "+str(failure))

if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0',
        'COOKIES_ENABLED': False,
        'DOWNLOAD_TIMEOUT' : 10,
        'DOWNLOAD_DELAY' : 3,
    })
    c.crawl(ProxySpider)
    c.start()
like image 142
Georgiy Avatar answered Nov 01 '22 02:11

Georgiy


you need write a downloader middleware, to install a process_exception hook, scrapy calls this hook when exception raised. in the hook, you could return a new Request object, with dont_filter=True flag, to let scrapy reschedule the request until it succeeds.

in the meanwhile, you could verify response extensively in process_response hook, check the status code, response content etc., and reschedule request as necessary.

in order to change proxy easily, you should use built-in HttpProxyMiddleware, instead of tinker with environ:

request.meta['proxy'] = proxy_address

take a look at this project as an example.

like image 3
georgexsh Avatar answered Nov 01 '22 02:11

georgexsh