I'm trying to do some scraping, but I get blocked every 4 requests. I have tried to change proxies but the error is the same. What should I do to change it properly?
Here is some code where I try it. First I get proxies from a free web. Then I go do the request with the new proxy but it doesn't work because I get blocked.
from fake_useragent import UserAgent
import requests
def get_player(id,proxy):
ua=UserAgent()
headers = {'User-Agent':ua.random}
url='https://www.transfermarkt.es/jadon-sancho/profil/spieler/'+str(id)
try:
print(proxy)
r=requests.get(u,headers=headers,proxies=proxy)
execpt:
....
code to manage the data
....
def get_proxies():
ua=UserAgent()
headers = {'User-Agent':ua.random}
url='https://free-proxy-list.net/'
r=requests.get(url,headers=headers)
page = BeautifulSoup(r.text, 'html.parser')
proxies=[]
for proxy in page.find_all('tr'):
i=ip=port=0
for data in proxy.find_all('td'):
if i==0:
ip=data.get_text()
if i==1:
port=data.get_text()
i+=1
if ip!=0 and port!=0:
proxies+=[{'http':'http://'+ip+':'+port}]
return proxies
proxies=get_proxies()
for i in range(1,100):
player=get_player(i,proxies[i//4])
....
code to manage the data
....
I know that proxies scrape is well because when i print then I see something like: {'http': 'http://88.12.48.61:42365'} I would like to don't get blocked.
Rotating IP addresses, or rotating IP proxy, change your IP address at each request you send to the target. The easiest way to quickly implement and start using a rotating proxy is by buying a residential proxy service. What are different types of proxies?
It is vital to keep in mind that when you need to use an IP address for a continuous duration, proxies with sticky sessions are preferred. In contrast, you would use a rotating proxy when you can not use the same IP address for a longer duration.
In order to use proxies in the requests Python library, you need to create a dictionary that defines the HTTP, HTTPS, and FTP connections. This allows each connection to map to an individual URL and port. This process is the same for any request being made, including GET requests and POST requests.
I recently had this same issue, but using proxy servers online as recommended in other answers is always risky (from privacy standpoint), slow, or unreliable.
Instead, you can use the requests-ip-rotator python library to proxy traffic through AWS API Gateway, which gives you a new IP each time:pip install requests-ip-rotator
This can be used as follows (for your site specifically):
import requests
from requests_ip_rotator import ApiGateway, EXTRA_REGIONS
gateway = ApiGateway("https://www.transfermarkt.es")
gateway.start()
session = requests.Session()
session.mount("https://www.transfermarkt.es", gateway)
response = session.get("https://www.transfermarkt.es/jadon-sancho/profil/spieler/your_id")
print(response.status_code)
# Only run this line if you are no longer going to run the script, as it takes longer to boot up again next time.
gateway.shutdown()
Combined with multithreading/multiprocessing, you'll be able to scrape the site in no time.
The AWS free tier provides you with 1 million requests per region, so this option will be free for all reasonable scraping.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With