I've been looking around trying to find a decent pooling system for Scrapy but I can't find anything that has everything I need/want.
I'm looking for a solution to:
If the proxy times out or is slow it should be blacklisted through a series of rules... (Scrapoxy only has blacklisting for number of instances / startups)
If a proxy is slow (takes over x time) it should be marked as Slow
and a timestamp should be taken and a counter should be increased.
Fail
and a timestamp should be taken and a counter should be increased.Anyone know of any such solution (the main feature being the blacklisting of slow/timed out proxies...
As your polling rules are very specifics, you may code your own, please see the code bellow which implement some part of your rules (you have to implement some other):
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import pexpect,time
from random import shuffle
#this func is use to test a single proxy
def test_proxy(ip,port,max_timeout=1):
child = pexpect.spawn("telnet " + ip + " " +str(port))
time_send_request=time.time()
try:
i=child.expect(["Connected to","Connection refused"], timeout=max_timeout) #max timeout in seconds
except pexpect.TIMEOUT:
i=-1
if i==0:
time_request_ok=time.time()
return {"status":True,"time_to_answer":time_request_ok-time_send_request}
else:
return {"status":False,"time_to_answer":max_timeout}
#this func is use to test all the current proxy and update status and apply your custom rules
def update_proxy_list_status(proxy_list):
for i in range(0,len(proxy_list)):
print ("testing proxy "+str(i)+" "+proxy_list[i]["ip"]+":"+str(proxy_list[i]["port"]))
proxy_status = test_proxy(proxy_list[i]["ip"],proxy_list[i]["port"])
proxy_list[i]["status_ok"]= proxy_status["status"]
print proxy_status
#here it is time to treat your own rule to update respective proxy dict
#~ If a proxy is slow (takes over x time) it should be marked as Slow and a timestamp should be taken and a counter should be increased.
#~ If a proxy timeout's it should be marked as Fail and a timestamp should be taken and a counter should be increased.
#~ If a proxy has no slows for 15 minutes after receiving its last slow then the counter & timestamp should be zeroed and the proxy gets returns back to a fresh state.
#~ If a proxy has no fails for 30 minutes after receiving its last fail then the counter & timestamp should be zeroed and the proxy gets returns back to a fresh state.
#~ If a proxy is slow 5 times in 1 hour then it should be removed from the pool for 1 hour.
#~ If a proxy timeout's 5 times in 1 hour then it should be blacklisted for 1 hour
#~ If a proxy get's blocked twice in 3 hours it should be blacklisted for 12 hours and marked as bad
#~ If a proxy gets marked as bad twice in 48 hours then it should notify me (email, push bullet... anything)
if proxy_status["status"]==True:
#modify proxy dict with your own rules (adding timestamp, last check time, last down, last up eFIRSTtc...)
#...
pass
else:
#modify proxy dict with your own rules (adding timestamp, last check time, last down, last up etc...)
#...
pass
return proxy_list
#this func select a good proxy and do the job
def main():
#first populate a proxy list | I get those example proxies list from http://free-proxy.cz/en/
proxy_list=[
{"ip":"167.99.2.12","port":8080}, #bad proxy
{"ip":"167.99.2.17","port":8080},
{"ip":"66.70.160.171","port":1080},
{"ip":"192.99.220.151","port":8080},
{"ip":"142.44.137.222","port":80}
# [...]
]
#this variable is use to keep track of last used proxy (to avoid to use the same one two consecutive time)
previous_proxy_ip=""
the_job=True
while the_job:
#here we update each proxy status
proxy_list = update_proxy_list_status(proxy_list)
#we keep only proxy considered as ok
good_proxy_list = [d for d in proxy_list if d['status_ok']==True]
#here you can shuffle the list
shuffle(good_proxy_list)
#select a proxy (not same last previous one)
current_proxy={}
for i in range(0,len(good_proxy_list)):
if good_proxy_list[i]["ip"]!=previous_proxy_ip:
previous_proxy_ip=good_proxy_list[i]["ip"]
current_proxy=good_proxy_list[i]
break
#use this selected proxy to do the job
print ("the current proxy is: "+str(current_proxy))
#UPDATE SCRAPY PROXY
#DO THE SCRAPY JOB
print "DO MY SCRAPY JOB with the current proxy settings"
#wait some seconds
time.sleep(5)
main()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With