I try to scrape Amazon by Scrapy. but i have this error
DEBUG: Retrying <GET http://www.amazon.fr/Amuses-bouche-Peuvent-b%C3%A9n%C3%A9ficier-dAmazon-Premium-Epicerie/s?ie=UTF8&page=1&rh=n%3A6356734031%2Cp_76%3A437878031>
(failed 1 times): 503 Service Unavailable
I think that it's because = Amazon is very good at detecting bots. How can i prevent this?
i used time.sleep(6)
before every request.
I don't want to use their API.
I tried I use tor and polipo
If you send repetitive requests from the same IP, the website owners can detect your footprint and may block your web scrapers by checking the server log files. To avoid this, you can use rotating proxies. A rotating proxy is a proxy server that allocates a new IP address from a set of proxies stored in the proxy pool.
Amazon can detect Bots and block their IPs Since Amazon prevents web scraping on its pages, it can easily detect if an action is being executed by a scraper bot or through a browser by a manual agent.
Web pages detect web crawlers and web scraping tools by checking their IP addresses, user agents, browser parameters, and general behavior. If the website finds it suspicious, you receive CAPTCHAs and then eventually your requests get blocked since your crawler is detected.
The most common methods to hide or change your IP are: using a virtual private network (VPN), proxy servers, and the Tor browser. The former is the most effective way to circumvent IP blocking by far.
You have to be very careful with Amazon and follow the Amazon Terms of Use and policies related to web-scraping.
Amazon is quite good at banning IPs of the bots. You would have to tweak the DOWNLOAD_DELAY
and CONCURRENT_REQUESTS
to hit the website less often and be a good web-scraping citizen. And, you would need to rotate IP addresses (you may look into, for instance, crawlera) and user agents.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With