Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Proxy Pooling System for Scrapy to temporarily stop using slow/timing out proxies

I've been looking around trying to find a decent pooling system for Scrapy but I can't find anything that has everything I need/want.

I'm looking for a solution to:

Rotate proxies

  • I'd like them randomly switch between proxies but never selecting the same proxy twice in a row. (Scrapoxy has this)

Impersonate Known Browsers

  • Impersonate Chrome, Firefox, Internet Explorer, Edge, Safari... etc (Scrapoxy has this)

Blacklist Slow Proxies

  • If the proxy times out or is slow it should be blacklisted through a series of rules... (Scrapoxy only has blacklisting for number of instances / startups)

  • If a proxy is slow (takes over x time) it should be marked as Slow and a timestamp should be taken and a counter should be increased.

  • If a proxy timeout's it should be marked as Fail and a timestamp should be taken and a counter should be increased.
  • If a proxy has no slows for 15 minutes after receiving its last slow then the counter & timestamp should be zeroed and the proxy gets returns back to a fresh state.
  • If a proxy has no fails for 30 minutes after receiving its last fail then the counter & timestamp should be zeroed and the proxy gets returns back to a fresh state.
  • If a proxy is slow 5 times in 1 hour then it should be removed from the pool for 1 hour.
  • If a proxy timeout's 5 times in 1 hour then it should be blacklisted for 1 hour
  • If a proxy get's blocked twice in 3 hours it should be blacklisted for 12 hours and marked as bad
  • If a proxy gets marked as bad twice in 48 hours then it should notify me (email, push bullet... anything)

Anyone know of any such solution (the main feature being the blacklisting of slow/timed out proxies...

like image 760
Ryflex Avatar asked Feb 21 '18 16:02

Ryflex


1 Answers

As your polling rules are very specifics, you may code your own, please see the code bellow which implement some part of your rules (you have to implement some other):

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import pexpect,time
from random import shuffle

#this func is use to test a single proxy
def test_proxy(ip,port,max_timeout=1):
    child = pexpect.spawn("telnet " + ip + " " +str(port))
    time_send_request=time.time()
    try:
        i=child.expect(["Connected to","Connection refused"], timeout=max_timeout) #max timeout in seconds
    except pexpect.TIMEOUT:
        i=-1
    if i==0:
        time_request_ok=time.time()
        return {"status":True,"time_to_answer":time_request_ok-time_send_request}
    else:
        return {"status":False,"time_to_answer":max_timeout}


#this func is use to test all the current proxy and update status and apply your custom rules
def update_proxy_list_status(proxy_list):
    for i in range(0,len(proxy_list)):
        print ("testing proxy "+str(i)+" "+proxy_list[i]["ip"]+":"+str(proxy_list[i]["port"]))
        proxy_status = test_proxy(proxy_list[i]["ip"],proxy_list[i]["port"])
        proxy_list[i]["status_ok"]= proxy_status["status"]


        print proxy_status

        #here it is time to treat your own rule to update respective proxy dict

        #~ If a proxy is slow (takes over x time) it should be marked as Slow and a timestamp should be taken and a counter should be increased.
        #~ If a proxy timeout's it should be marked as Fail and a timestamp should be taken and a counter should be increased.
        #~ If a proxy has no slows for 15 minutes after receiving its last slow then the counter & timestamp should be zeroed and the proxy gets returns back to a fresh state.
        #~ If a proxy has no fails for 30 minutes after receiving its last fail then the counter & timestamp should be zeroed and the proxy gets returns back to a fresh state.
        #~ If a proxy is slow 5 times in 1 hour then it should be removed from the pool for 1 hour.
        #~ If a proxy timeout's 5 times in 1 hour then it should be blacklisted for 1 hour
        #~ If a proxy get's blocked twice in 3 hours it should be blacklisted for 12 hours and marked as bad
        #~ If a proxy gets marked as bad twice in 48 hours then it should notify me (email, push bullet... anything)        

        if proxy_status["status"]==True:
            #modify proxy dict with your own rules (adding timestamp, last check time, last down, last up eFIRSTtc...)
            #...
            pass
        else:
            #modify proxy dict with your own rules (adding timestamp, last check time, last down, last up etc...)
            #...
            pass        

    return proxy_list


#this func select a good proxy and do the job
def main():

    #first populate a proxy list | I get those example proxies list from http://free-proxy.cz/en/
    proxy_list=[
        {"ip":"167.99.2.12","port":8080}, #bad proxy
        {"ip":"167.99.2.17","port":8080},
        {"ip":"66.70.160.171","port":1080},
        {"ip":"192.99.220.151","port":8080},
        {"ip":"142.44.137.222","port":80}
        # [...]
    ]



    #this variable is use to keep track of last used proxy (to avoid to use the same one two consecutive time)
    previous_proxy_ip=""

    the_job=True
    while the_job:

        #here we update each proxy status
        proxy_list = update_proxy_list_status(proxy_list)

        #we keep only proxy considered as ok
        good_proxy_list = [d for d in proxy_list if d['status_ok']==True]

        #here you can shuffle the list
        shuffle(good_proxy_list)

        #select a proxy (not same last previous one)
        current_proxy={}
        for i in range(0,len(good_proxy_list)):
            if good_proxy_list[i]["ip"]!=previous_proxy_ip:
                previous_proxy_ip=good_proxy_list[i]["ip"]
                current_proxy=good_proxy_list[i]
                break

        #use this selected proxy to do the job
        print ("the current proxy is: "+str(current_proxy))

        #UPDATE SCRAPY PROXY

        #DO THE SCRAPY JOB
        print "DO MY SCRAPY JOB with the current proxy settings"

        #wait some seconds
        time.sleep(5)

main()
like image 121
A STEFANI Avatar answered Sep 30 '22 00:09

A STEFANI