Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting blocked when scraping Amazon (even with headers, proxies, delay) [closed]

I have a Python code to scrape Amazon product listing. I have set the proxies and headers. I also have sleep() before each crawl. However, I still cannot get the data. The msg I get back is:

To discuss automated access to Amazon data please contact [email protected]

Portions of my code are:

url = "https://www.amazon.com/Baby-Girls-Shoes/b/ref=sv_sl_fl_7239798011?ie=UTF8&node=7239798011"
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
proxies_list = ["128.199.109.241:8080","113.53.230.195:3128","125.141.200.53:80","125.141.200.14:80","128.199.200.112:138","149.56.123.99:3128","128.199.200.112:80","125.141.200.39:80","134.213.29.202:4444"]
proxies = {'https': random.choice(proxies_list)}
time.sleep(0.5 * random.random())
r = requests.get(url, headers, proxies=proxies)
page_html = r.content
print page_html

This question is not a duplicate of others available on Stackoverflow, because the others suggest using proxies, headers and delay(sleep), and I have already done all of that that. I am unable to scrape even after doing what they suggest.

The code was working initially, but stopped working after scraping a few pages.

like image 250
Tapa Dipti Sitaula Avatar asked Dec 28 '16 16:12

Tapa Dipti Sitaula


3 Answers

Instead of:

r = requests.get(url, headers, proxies=proxies)

Do:

r = requests.get(url, headers=headers, proxies=proxies)

This resolved the issue for me for now. Hopefully, the resolution will keep working.

like image 181
Tapa Dipti Sitaula Avatar answered Sep 19 '22 19:09

Tapa Dipti Sitaula


From what you describe, Amazon is likely doing something extra (for example with your cookies) to check whether you are using a browser. It's not that you can't get over it though: what I'd do to see the difference between a request from your browser and a request from your script is to inspect the browser and copy as curl one request to amazon. Then transform the curl command to python requests code with this tool. There you have a request that looks exactly like the one on your browser. Do this a couple of times to understand if/how amazon is modifying your cookies on each request, and then try to mimmic this behavior with your script.

If you are sure that the requests look exactly the same, you probably need to increase the waiting time between two consecutive requests. I hope this helps.

like image 30
thelastone Avatar answered Sep 18 '22 19:09

thelastone


Try using sessions in Requests. It will remember cookies and headers. If that fails I would try using selenium 2 with either the chrome driver or phantomjs driver if you prefer headless.

like image 24
fat fantasma Avatar answered Sep 19 '22 19:09

fat fantasma