Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to prevent getting blacklisted while scraping Amazon [closed]

I try to scrape Amazon by Scrapy. but i have this error

DEBUG: Retrying <GET http://www.amazon.fr/Amuses-bouche-Peuvent-b%C3%A9n%C3%A9ficier-dAmazon-Premium-Epicerie/s?ie=UTF8&page=1&rh=n%3A6356734031%2Cp_76%3A437878031> 
(failed 1 times): 503 Service Unavailable

I think that it's because = Amazon is very good at detecting bots. How can i prevent this?

i used time.sleep(6) before every request.

I don't want to use their API.

I tried I use tor and polipo

like image 981
parik Avatar asked May 06 '16 16:05

parik


People also ask

How do I stop being blocked from web scraping?

If you send repetitive requests from the same IP, the website owners can detect your footprint and may block your web scrapers by checking the server log files. To avoid this, you can use rotating proxies. A rotating proxy is a proxy server that allocates a new IP address from a set of proxies stored in the proxy pool.

Does Amazon ban web scraping?

Amazon can detect Bots and block their IPs Since Amazon prevents web scraping on its pages, it can easily detect if an action is being executed by a scraper bot or through a browser by a manual agent.

Can you get blocked for web scraping?

Web pages detect web crawlers and web scraping tools by checking their IP addresses, user agents, browser parameters, and general behavior. If the website finds it suspicious, you receive CAPTCHAs and then eventually your requests get blocked since your crawler is detected.

How do I stop IP blocking?

The most common methods to hide or change your IP are: using a virtual private network (VPN), proxy servers, and the Tor browser. The former is the most effective way to circumvent IP blocking by far.


1 Answers

You have to be very careful with Amazon and follow the Amazon Terms of Use and policies related to web-scraping.

Amazon is quite good at banning IPs of the bots. You would have to tweak the DOWNLOAD_DELAY and CONCURRENT_REQUESTS to hit the website less often and be a good web-scraping citizen. And, you would need to rotate IP addresses (you may look into, for instance, crawlera) and user agents.

like image 143
alecxe Avatar answered Sep 22 '22 15:09

alecxe