I have a Python code to scrape Amazon product listing. I have set the proxies and headers. I also have sleep()
before each crawl. However, I still cannot get the data. The msg I get back is:
To discuss automated access to Amazon data please contact [email protected]
Portions of my code are:
url = "https://www.amazon.com/Baby-Girls-Shoes/b/ref=sv_sl_fl_7239798011?ie=UTF8&node=7239798011"
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
proxies_list = ["128.199.109.241:8080","113.53.230.195:3128","125.141.200.53:80","125.141.200.14:80","128.199.200.112:138","149.56.123.99:3128","128.199.200.112:80","125.141.200.39:80","134.213.29.202:4444"]
proxies = {'https': random.choice(proxies_list)}
time.sleep(0.5 * random.random())
r = requests.get(url, headers, proxies=proxies)
page_html = r.content
print page_html
This question is not a duplicate of others available on Stackoverflow, because the others suggest using proxies, headers and delay(sleep), and I have already done all of that that. I am unable to scrape even after doing what they suggest.
The code was working initially, but stopped working after scraping a few pages.
Instead of:
r = requests.get(url, headers, proxies=proxies)
Do:
r = requests.get(url, headers=headers, proxies=proxies)
This resolved the issue for me for now. Hopefully, the resolution will keep working.
From what you describe, Amazon is likely doing something extra (for example with your cookies) to check whether you are using a browser. It's not that you can't get over it though: what I'd do to see the difference between a request from your browser and a request from your script is to inspect the browser and copy as curl one request to amazon. Then transform the curl command to python requests code with this tool. There you have a request that looks exactly like the one on your browser. Do this a couple of times to understand if/how amazon is modifying your cookies on each request, and then try to mimmic this behavior with your script.
If you are sure that the requests look exactly the same, you probably need to increase the waiting time between two consecutive requests. I hope this helps.
Try using sessions in Requests. It will remember cookies and headers. If that fails I would try using selenium 2 with either the chrome driver or phantomjs driver if you prefer headless.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With