Description:
I am scraping amazon product prices, and I need to alter the selected address(a.k.a delivery address) so as to scrape the prices of the same product with respect to different locations(which could be different).
I have been working on this project for months, and I found that unlike other sites that I have scraped, which return price info via a javascript request to their server given relevant arguments including something like skuid, productid, and of course location; amazon does not do this to return its price info. It looks like it just request the whole html page with price being part of it. And because of this, there are no simple way to scrape prices of different location by simply specifying different location parameter in javascript request to server. As far as I can tell, amazon returns price info based on selected address, which is identified by cookies. And hence my scraping strategy.
I tried to load amazon on my web browser and manually retrieve the cookies in the Chrome network, choosing different locations, so that I may use these cookies in my script to retrieve different prices.(The assumption I made here is that the location info is stored in cookies.)
It kinda worked, I can scrape different prices using different cookies.
Symptom:
Problem arises after I tried this way of scraping for just several requests. Everything works fine at the first cycle of requesting the amazon page with each of the saved cookies . But after that, sending requests using two of my previously saved cookies of two different locations would return the same price(which should be different), and I look into the page source code and find that the selected address is also the same(which is no surprise, since the price is the same.)
Desired result:
The ultimate goal here is to be able to scrape location-based prices off amazon page. The current objective here is that the previously manually saved cookies would help getting back different page source code based on desired locations.
What happens instead:
Using manually saved cookies to request amazon page seem to cease working after a cycle of requesting, different cookies return the same price/location info.
Note:
Note that, We could just focus on the address selected, since price is a function of the selected address. So my code below will try to print out the selected address and leave out the price.
Code:
headers =
{'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
#location:ShangHai, price:17,999
#,'Cookie':'x-wl-uid=1DpDBnhSVJ+bNXzDZYpD4q7+iDfJ6GQATOxQy6bH2BnaDE4n/i4aKzzAQ0HKWvjhi4SmEwEIuSuA=; session-token=eYnxCsjigX0nCy8skngiSDkvjfEZlKavU9mTR8e9EP1Lh0pg4oYpoBxP3adQe7vZE9IDvl7xeLN+H5WF25TVXNTTywA3/Y82cuN+a2CdJs1L57Mzvwq7aLrbwtYQJfG2e1WP5/EXrV7oE02b8TB7KJA36q4w351NbUttqmq/yVrJQj7CZ+HMYcIsoxH2Ux8awhZ9jsROFJcaLmtcy+6muoLrtYQpa/QX230yKBQA90lu+D9jtd46BQ==; csm-hit=s-ZAYA2VY7F5MBZA6VKTTQ|1482202685913; ubid-acbcn=453-6620157-1313521; session-id-time=2082729601l; session-id=455-5193383-6307663'
#location: BeiJing price:15,999
,'Cookie':'x-acbcn=EwNBz6OLTIFDxQCv1qiUE4m16A00AUKs; at-main=Atza|IwEBINcqsHbV-1tFBCYlshzjTAyv5Z4msKVZ0rbOATXYrjE7AcoO3LSnYDzYpZcY2C4WP3oOPIlqWLWh9UcAzDHu6Xv6xcdbCW7jQ59cifSfpYiv3UQ0qR5Hk2VJjX0dcrsdgJUw-TWW8ZWLLhs2Z_CTD7Mphdn9fgvg7qnREuayGRpxekotq9lRXxeqJn3-IfoanhF9edDc0MYk2jTDtJv0AiJp71Wwo6PsNRTwwCg0JS69-H5QYeRbXfFSP-dTtVSGzB-MgVo4zX6dRSmYQ12_rjbfZa7ihj0s-3KtBFLnVP-R91VJrvDwMBSjfcyJHL734UfSrN6D6c1MCq76NoM-MpzmKncsn3n7Ruhnxork43k0onNA0jTl4SD1UDQ8dweuxP6FN0O7eTrWTaBkP_isuiDI; sess-at-main="Wyf/mENo8M2ZhLuc1RWCf++uvPG19jd3RE0X61PIhrk="; x-wl-uid=12Hr4lOV8Md2tj2TjdgVpNVGb5aL6MrEz19aI0yHjr7FY8N3HsTCe29HlZhe4NCBbeDw2KuN5ShkJajzdy70eGSYuSAIda2OF1CcLpnHo+Bd7mvKvVqTsj1pNwri9d8E2lMOUplbiuZ8=; session-id-time-cn=1482739200l; session-id-cn=452-5760864-5873122; session-token=h1J7fMqt9UYrlp3EVScY8zWkFsNT7oGwBzJLHkKb8ChGVAMO/6quZxt9R24wwGPUCc4BPFLofrOQ5ZG9Jf9KQ5Y7j6XhKqlUh9j3g60qdVTgNSM6gY+eERRbI7iWTLGXQwEBB9LOx49+htkQIMfw1coTjYn50RlfUeeuW9dE8Db937LkwRFJe1ewcyebJZ713u/9HGAFQvCwatOslgNVHrpWOPGW91OUqhkYdW9wS6G46ScDqefXu2tRqWL8mOKn7t4wdMGqaF8=; csm-hit=YWXGRBPABQJZHSN0TYMN+s-88A1TVR7SK2BG3FDDW2M|1482139543670; ubid-acbcn=453-1347853-1253656; session-id-time=2082729601l; session-id=452-5760864-5873122'
#location: ShenZhen price:17,999
#,'Cookie':'x-wl-uid=1SGKhC3F2g+mtZV/OFWnOwBOuLf8I+HnSJZOyHVbtVHXhyEkpj6cGqURI4kbZl/A7I3J2/0ByMc4=; session-token=TnIBK9s4/NJRHfHVd9gnxg4EA9GZ6wGk9AdAwc1tC5YNWYxS4S9p1IloF+Ex5lQ7O/4DlGB2WpPT7OdrCn/wyhNqLkTUB9ChdqvX3dw0UZW/Rhxsy8gTbdq5BrWCoHIL8y24sAM47Y7YZeAy6MAu9tXxH9wEtb4CF2BqTsp/B3hjGxkNuKwA8tQ1pEAZhnkzFx6tIdAfIvNWCN3c7NmbCoLRELpprDAbYrlLL/ik6lKvBvawLzAqng==; csm-hit=6CE5VNWDNAEMSWZT2VJJ+s-1BMM2C5SJPJV4B4RGMH7|1482205167978; ubid-acbcn=453-9465199-8612643; session-id-time=2082729601l; session-id=456-0427731-7194850'
#location: SuZhou, price:13,999
#,'Cookie':'x-wl-uid=1SGKhC3F2g+mtZV/OFWnOwBOuLf8I+HnSJZOyHVbtVHXhyEkpj6cGqURI4kbZl/A7I3J2/0ByMc4=; session-token=TnIBK9s4/NJRHfHVd9gnxg4EA9GZ6wGk9AdAwc1tC5YNWYxS4S9p1IloF+Ex5lQ7O/4DlGB2WpPT7OdrCn/wyhNqLkTUB9ChdqvX3dw0UZW/Rhxsy8gTbdq5BrWCoHIL8y24sAM47Y7YZeAy6MAu9tXxH9wEtb4CF2BqTsp/B3hjGxkNuKwA8tQ1pEAZhnkzFx6tIdAfIvNWCN3c7NmbCoLRELpprDAbYrlLL/ik6lKvBvawLzAqng==; csm-hit=6CE5VNWDNAEMSWZT2VJJ+s-30EVQ4E6P9Y51PV6WGMW|1482205288109; ubid-acbcn=453-9465199-8612643; session-id-time=2082729601l; session-id=456-0427731-7194850'
,'Host':'www.amazon.cn'
,'X-Requested-With':'XMLHttpRequest'}
url = 'https://www.amazon.cn/TCL-%E7%8E%8B%E7%89%8C-L65C2-CUDG-65%E8%8B%B1%E5%AF%B8-%E6%96%B0%E7%9A%84HDR%E6%8A%80%E6%9C%AF-%E5%85%A8%E6%96%B0%E7%9A%84%E9%87%8F%E5%AD%90%E7%82%B9%E6%8A%80%E6%9C%AF-%E9%BB%91%E8%89%B2/dp/B01FXB0ZG4/ref=sr_1_2?ie=UTF8&qid=1476165637&sr=8-2&keywords=L65C2-CUDG'
def getAddress(url):
response = requests.get(url,headers = headers)
tree = html.fromstring(response.content)
xpath = '//span[@id="ddmSelectedAddressText"]'
print(tree.xpath(xpath)[0].text)
getAddress(url)
Also , to be more clear, there are 4 locations here, as you can verify by the 4 cookies above and basically what I do is commenting out the other three and leave one in the headers and call getAddress(url)
I figure this could be a common problem for those who is trying to do the same thing(scraping prices based on different locations). Any thoughts will be appreciated! Also, I mentioned my ultimate goal, so a new method of achieving the same result will be relevant to this question too.
I finally figured out what the problem is and resolve it. Yay! So here goes.
1.The assumption that I made is correct, namely that cookies help amazon determine your selected address and it turns out actually more than just that, cookies alone will do the job.
2.The reason why I was getting the result that cookies randomly fails to retain selected address info is that for each cookies there is this session-id value, as you can notice in the last cookies in my code above, and I will put it down here just for convenience. And somehow I got this two cookies that failed with the same session-id. Once I clear my cookies in Chrome and get a new cookies with a different session-id for the malfunction cookies, everything will work.
3.I only have a vague idea about how session-id help determine the selected address info, however, so if anyone could explain how exactly amazon uses this session-id to return one selected address rather than another, that'd be great.
#location: ShenZhen, price: 17,999
'Cookie':'session-id=456-0427731-7194850, other parts = other parts'
#location: SuZhou, price: 13,999
'Cookie':'session-id=456-0427731-7194850, other parts = other parts'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With