Trouble scraping all the books from a section without hardcoding payload

Tags:

I've created a script to scrape the name of different books under Customers who bought this item also bought section from such pages. Once you click on the right arrow button, you can find all the related books. I've used two different book links within the script to see how the script behaves.

The payload that I used within post requests are hardcoded and meant for the first link of the product_links. The payload appears to be available in page source but I can't find the right way to use it automatically. There are several ids within payload which may not be identical when I use another book link, so hardcoing payload doesn't seem to be a good idea.

I've tried with:

import requests
from bs4 import BeautifulSoup

product_links = [
    'https://www.amazon.com/Essential-Keto-Diet-Beginners-2019/dp/1099697018/',
    'https://www.amazon.com/Keto-Cookbook-Beginners-Low-Carb-Homemade/dp/B08QFBMSFT/'
]

url = 'https://www.amazon.com/acp/p13n-desktop-carousel/funjjvdbohwkuezi/getCarouselItems'
payload = {"aCarouselOptions":"{\"ajax\":{\"id_list\":[\"{\\\"id\\\":\\\"B07NYZJX2L\\\"}\",\"{\\\"id\\\":\\\"1939754445\\\"}\",\"{\\\"id\\\":\\\"1792145454\\\"}\",\"{\\\"id\\\":\\\"1073560988\\\"}\",\"{\\\"id\\\":\\\"1119578922\\\"}\",\"{\\\"id\\\":\\\"B083K5RRSG\\\"}\",\"{\\\"id\\\":\\\"B07SPSXHZ8\\\"}\",\"{\\\"id\\\":\\\"B08GG2RL1D\\\"}\",\"{\\\"id\\\":\\\"1507212305\\\"}\",\"{\\\"id\\\":\\\"B08QFBMSFT\\\"}\",\"{\\\"id\\\":\\\"164152247X\\\"}\",\"{\\\"id\\\":\\\"1673455980\\\"}\",\"{\\\"id\\\":\\\"B084DD8WHP\\\"}\",\"{\\\"id\\\":\\\"1706342667\\\"}\",\"{\\\"id\\\":\\\"1628603135\\\"}\",\"{\\\"id\\\":\\\"B08NZV2Z4N\\\"}\",\"{\\\"id\\\":\\\"1942411294\\\"}\",\"{\\\"id\\\":\\\"1507209924\\\"}\",\"{\\\"id\\\":\\\"1641520434\\\"}\",\"{\\\"id\\\":\\\"B084Z7627Q\\\"}\",\"{\\\"id\\\":\\\"B08NRXFZ98\\\"}\",\"{\\\"id\\\":\\\"1623159326\\\"}\",\"{\\\"id\\\":\\\"B0827DHLR6\\\"}\",\"{\\\"id\\\":\\\"B08TL5W56Z\\\"}\",\"{\\\"id\\\":\\\"1941169171\\\"}\",\"{\\\"id\\\":\\\"1645670945\\\"}\",\"{\\\"id\\\":\\\"B08GLSSNKF\\\"}\",\"{\\\"id\\\":\\\"B08RR4RJHB\\\"}\",\"{\\\"id\\\":\\\"B07WRQ4CF4\\\"}\",\"{\\\"id\\\":\\\"B08Y49Z3V1\\\"}\",\"{\\\"id\\\":\\\"B08LNX32ZL\\\"}\",\"{\\\"id\\\":\\\"1250621097\\\"}\",\"{\\\"id\\\":\\\"1628600071\\\"}\",\"{\\\"id\\\":\\\"1646115511\\\"}\",\"{\\\"id\\\":\\\"1705799507\\\"}\",\"{\\\"id\\\":\\\"B08XZCM2P4\\\"}\",\"{\\\"id\\\":\\\"1072855267\\\"}\",\"{\\\"id\\\":\\\"B08VCMWPB9\\\"}\",\"{\\\"id\\\":\\\"1623159229\\\"}\",\"{\\\"id\\\":\\\"B08KH2J3FM\\\"}\",\"{\\\"id\\\":\\\"B08D54RBGP\\\"}\",\"{\\\"id\\\":\\\"1507212992\\\"}\",\"{\\\"id\\\":\\\"1635653894\\\"}\",\"{\\\"id\\\":\\\"B01MUB7BUV\\\"}\",\"{\\\"id\\\":\\\"0358120861\\\"}\",\"{\\\"id\\\":\\\"B08FV23D3F\\\"}\",\"{\\\"id\\\":\\\"B08FNMP9YY\\\"}\",\"{\\\"id\\\":\\\"1671590902\\\"}\",\"{\\\"id\\\":\\\"1641527692\\\"}\",\"{\\\"id\\\":\\\"1628603917\\\"}\",\"{\\\"id\\\":\\\"B07ZHPQBVZ\\\"}\",\"{\\\"id\\\":\\\"B08Y49Y63B\\\"}\",\"{\\\"id\\\":\\\"B08T2QRSN3\\\"}\",\"{\\\"id\\\":\\\"1729392164\\\"}\",\"{\\\"id\\\":\\\"B08T46R6XC\\\"}\",\"{\\\"id\\\":\\\"B08RRF5V1D\\\"}\",\"{\\\"id\\\":\\\"1592339727\\\"}\",\"{\\\"id\\\":\\\"1628602929\\\"}\",\"{\\\"id\\\":\\\"1984857088\\\"}\",\"{\\\"id\\\":\\\"0316529583\\\"}\",\"{\\\"id\\\":\\\"1641524820\\\"}\",\"{\\\"id\\\":\\\"1628602635\\\"}\",\"{\\\"id\\\":\\\"B00GRIR87M\\\"}\",\"{\\\"id\\\":\\\"B08FBHN5H7\\\"}\",\"{\\\"id\\\":\\\"B06ZYSS7HS\\\"}\"]},\"autoAdjustHeightFreescroll\":true,\"first_item_flush_left\":false,\"initThreshold\":100,\"loadingThresholdPixels\":100,\"name\":\"p13n-sc-shoveler_n1in5tlbg2h\",\"nextRequestSize\":6,\"set_size\":65}","faceoutspecs":"{}","faceoutkataname":"GeneralFaceout","individuals":"0","language":"en-US","linkparameters":"{\"pd_rd_w\":\"eouzj\",\"pf_rd_p\":\"45451e33-456f-46b5-8f06-aedad504c3d0\",\"pf_rd_r\":\"6Q3MPZHQQ2ESWZND1K8T\",\"pd_rd_r\":\"e5e43c03-d78d-41d3-9064-87af93f9856b\",\"pd_rd_wg\":\"PdhmI\"}","marketplaceid":"ATVPDKIKX0DER","name":"p13n-sc-shoveler_n1in5tlbg2h","offset":"6","reftagprefix":"pd_sim","aDisplayStrategy":"swap","aTransitionStrategy":"swap","aAjaxStrategy":"promise","ids":["{\"id\":\"B07SPSXHZ8\"}","{\"id\":\"B08GG2RL1D\"}","{\"id\":\"1507212305\"}","{\"id\":\"B08QFBMSFT\"}","{\"id\":\"164152247X\"}","{\"id\":\"1673455980\"}","{\"id\":\"B084DD8WHP\"}","{\"id\":\"1706342667\"}","{\"id\":\"1628603135\"}"],"indexes":[6,7,8,9,10,11,12,13,14]}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
    # for product_link in product_links:
    s.headers['x-amz-acp-params'] = "tok=0DV5j8DDJsH8JQfdVFxJFD3p6AZraMOZTik-kgzNi08;ts=1619674837835;rid=ER1GSMM13VTETPS90K43;d1=251;d2=0;tpm=CGHBD;ref=rtpb"
    res = s.post(url,json=payload)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select("li.a-carousel-card-fragment > a.a-link-normal > div[data-rows]"):
        print(item.text)

How can I scrape all the books from customers who bought section without hardcoding payload?

628

asked Apr 29 '21 06:04

robots.txt

Video Answer

1 Answers

Everything that you need in order to get the carousel data is in the initial request when you query for the product URL.

You need to get full product HTML, scoop out the carousel data and reuse parts of it to construct a valid payload that can be used in the follow-up POST requests.

However, getting the product HTML is the hardest part, at least on my end, as Amazon will either block or throw a CAPTCHA, if you request the HTML too often.

Using a proxy or VPN helps. Swapping the product URL does help sometimes too.

Summing up, the key is to get the product HTML. The subsequent requests are easy to make and are not throttled, AFAIK.

Here's how to get the data for and from the carousel:

import json
import re

import requests
from bs4 import BeautifulSoup


# The chunk is how many carousel items are going to be requested for;
# this can vary from 4 - 10 items, as on the web-page.
# Also, the other list is used as the indexes key in the payload.
def get_idx_and_indexes(carousel_ids: list, chunk: int = 5) -> iter:
    for index in range(0, len(carousel_ids), chunk):
        tmp = carousel_ids[index:index + chunk]
        yield tmp, [carousel_ids.index(item) for item in tmp]


headers = {
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/90.0.4430.93 Safari/537.36",
}

product_url = 'https://www.amazon.de/Rust-Programming-Language-Covers-2018/dp/1718500440/'
# Getting the product HTML as it carries all the carousel data items 
with requests.Session() as session:
    r = session.get("https://www.amazon.com", headers=headers)
    page = session.get(product_url, headers=headers)

# This is where the carousel data sits along with all the items needed to make
# the following requests e.g. items, acp-params, linkparameters, marketplaceid etc.
initial_soup = BeautifulSoup(
    re.search(r"<!--CardsClient-->(.*)<input", page.text).group(1),
    "lxml",
).find_all("div")

# Preparing all the details for subsequent requests to carousel_endpoint
item_ids = json.loads(initial_soup[3]["data-a-carousel-options"])["ajax"]["id_list"]
payload = {
    "aAjaxStrategy": "promise",
    "aCarouselOptions": initial_soup[3]["data-a-carousel-options"],
    "aDisplayStrategy": "swap",
    "aTransitionStrategy": "swap",
    "faceoutkataname": "GeneralFaceout",
    "faceoutspecs": "{}",
    "individuals": "0",
    "language": "en-US",
    "linkparameters": initial_soup[0]["data-acp-tracking"],
    "marketplaceid": initial_soup[3]["data-marketplaceid"],
    "name": "p13n-sc-shoveler_hgm4oj1hneo",  # this changes by can be ignored
    "offset": "6",
    "reftagprefix": "pd_sim",
}

headers.update(
    {
        "x-amz-acp-params": initial_soup[0]["data-acp-params"],
        "x-requested-with": "XMLHttpRequest",
    }
)

# looping through the carousel data and performing requests
carousel_endpoint = " https://www.amazon.com/acp/p13n-desktop-carousel/funjjvdbohwkuezi/getCarouselItems"
for ids, indexes in get_idx_and_indexes(item_ids):
    payload["ids"] = ids
    payload["indexes"] = indexes
    # The actual carousel data
    response = session.post(carousel_endpoint, json=payload, headers=headers)
    carousel = BeautifulSoup(response.text, "lxml").find_all("a")
    print("\n".join(a.getText() for a in carousel))

This should output:

Cracking the Coding Interview: 189 Programming Questions and Solutions
Gayle Laakmann McDowell
4.7 out of 5 starsâ4,864
#1 Best Sellerin Computer Hacking
$24.00

Container Security: Fundamental Technology Concepts that Protect Containerized Applications
Liz Rice
4.7 out of 5 starsâ102
$35.42

Linux Bible
Christopher Negus
4.8 out of 5 starsâ245
#1 Best Sellerin Linux Servers
$31.99

System Design Interview â An insider's guide, Second Edition
Alex Xu
4.5 out of 5 starsâ568
#1 Best Sellerin Bioinformatics
$24.99

Ansible for DevOps: Server and configuration management for humans
Jeff Geerling
4.6 out of 5 starsâ127
$17.35

Effective C: An Introduction to Professional C Programming
Robert C. Seacord
4.5 out of 5 starsâ94
$32.99

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
AurÃ©lien GÃ©ron
4.8 out of 5 starsâ1,954
#1 Best Sellerin Computer Neural Networks
$32.93

Head First Design Patterns: Building Extensible and Maintainable Object-Oriented Software
Eric Freeman
4.7 out of 5 starsâ67
$41.45

Fluent Python: Clear, Concise, and Effective Programming
Luciano Ramalho
4.6 out of 5 starsâ523
54 offers from $32.24

TCP/IP Illustrated, Volume 1: The Protocols (Addison-Wesley Professional Computing Series)
4.6 out of 5 starsâ199
$63.26

Operating Systems: Three Easy Pieces
4.7 out of 5 starsâ224
#1 Best Sellerin Computer Operating Systems Theory
$24.61

Software Engineering at Google: Lessons Learned from Programming Over Time
Titus Winters
4.6 out of 5 starsâ243
$44.52

and so on ...

121

answered Nov 06 '22 00:11

baduker

Related questions
                            
                                Pandas group by result to columns
                            
                                How to retrieve all the content of calls made to a mock?
                            
                                What are C classes for a NLLLoss loss function in Pytorch?
                            
                                Python Image Processing on Captcha how to remove noise
                            
                                ValueError: set_wakeup_fd only works in main thread on Windows on Python 3.8 with Django 3.0.2 or Flask 2.0.0
                            
                                How to combine javascript/react frontend and python backend?
                            
                                Using mypy with with lazy initialization of instance attributes
                            
                                Why does creating a list of tuples using list comprehension requires parentheses?
                            
                                SQLAlchemy engine from Airflow database hook
                            
                                Python enum meta making typing module crash
                            
                                RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation?
                            
                                How to generate a model for a mongoengine Document from an already existing collection
                            
                                Efficiently finding consecutive streaks in a pandas DataFrame column?
                            
                                Recursive Operation in Pandas
                            
                                Python3.9 malloc: can't allocate region error 3
                            
                                Understanding WeightedKappaLoss using Keras
                            
                                Using python-coveralls from github-actions returns "Could not submit coverage: 422 Client Error"
                            
                                Gradient Accumulation with Custom model.fit in TF.Keras?
                            
                                What is wrong with the syntax of this simple Python list?
                            
                                Django 3.2 AttributeError: 'TextField' object has no attribute 'db_collation'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Trouble scraping all the books from a section without hardcoding payload

Tags:

python

python-3.x

python-requests

web-scraping

robots.txt

People also ask

Video Answer

1 Answers

baduker

Recent Activity

Donate For Us