Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trouble scraping all the books from a section without hardcoding payload

I've created a script to scrape the name of different books under Customers who bought this item also bought section from such pages. Once you click on the right arrow button, you can find all the related books. I've used two different book links within the script to see how the script behaves.

The payload that I used within post requests are hardcoded and meant for the first link of the product_links. The payload appears to be available in page source but I can't find the right way to use it automatically. There are several ids within payload which may not be identical when I use another book link, so hardcoing payload doesn't seem to be a good idea.

I've tried with:

import requests
from bs4 import BeautifulSoup

product_links = [
    'https://www.amazon.com/Essential-Keto-Diet-Beginners-2019/dp/1099697018/',
    'https://www.amazon.com/Keto-Cookbook-Beginners-Low-Carb-Homemade/dp/B08QFBMSFT/'
]

url = 'https://www.amazon.com/acp/p13n-desktop-carousel/funjjvdbohwkuezi/getCarouselItems'
payload = {"aCarouselOptions":"{\"ajax\":{\"id_list\":[\"{\\\"id\\\":\\\"B07NYZJX2L\\\"}\",\"{\\\"id\\\":\\\"1939754445\\\"}\",\"{\\\"id\\\":\\\"1792145454\\\"}\",\"{\\\"id\\\":\\\"1073560988\\\"}\",\"{\\\"id\\\":\\\"1119578922\\\"}\",\"{\\\"id\\\":\\\"B083K5RRSG\\\"}\",\"{\\\"id\\\":\\\"B07SPSXHZ8\\\"}\",\"{\\\"id\\\":\\\"B08GG2RL1D\\\"}\",\"{\\\"id\\\":\\\"1507212305\\\"}\",\"{\\\"id\\\":\\\"B08QFBMSFT\\\"}\",\"{\\\"id\\\":\\\"164152247X\\\"}\",\"{\\\"id\\\":\\\"1673455980\\\"}\",\"{\\\"id\\\":\\\"B084DD8WHP\\\"}\",\"{\\\"id\\\":\\\"1706342667\\\"}\",\"{\\\"id\\\":\\\"1628603135\\\"}\",\"{\\\"id\\\":\\\"B08NZV2Z4N\\\"}\",\"{\\\"id\\\":\\\"1942411294\\\"}\",\"{\\\"id\\\":\\\"1507209924\\\"}\",\"{\\\"id\\\":\\\"1641520434\\\"}\",\"{\\\"id\\\":\\\"B084Z7627Q\\\"}\",\"{\\\"id\\\":\\\"B08NRXFZ98\\\"}\",\"{\\\"id\\\":\\\"1623159326\\\"}\",\"{\\\"id\\\":\\\"B0827DHLR6\\\"}\",\"{\\\"id\\\":\\\"B08TL5W56Z\\\"}\",\"{\\\"id\\\":\\\"1941169171\\\"}\",\"{\\\"id\\\":\\\"1645670945\\\"}\",\"{\\\"id\\\":\\\"B08GLSSNKF\\\"}\",\"{\\\"id\\\":\\\"B08RR4RJHB\\\"}\",\"{\\\"id\\\":\\\"B07WRQ4CF4\\\"}\",\"{\\\"id\\\":\\\"B08Y49Z3V1\\\"}\",\"{\\\"id\\\":\\\"B08LNX32ZL\\\"}\",\"{\\\"id\\\":\\\"1250621097\\\"}\",\"{\\\"id\\\":\\\"1628600071\\\"}\",\"{\\\"id\\\":\\\"1646115511\\\"}\",\"{\\\"id\\\":\\\"1705799507\\\"}\",\"{\\\"id\\\":\\\"B08XZCM2P4\\\"}\",\"{\\\"id\\\":\\\"1072855267\\\"}\",\"{\\\"id\\\":\\\"B08VCMWPB9\\\"}\",\"{\\\"id\\\":\\\"1623159229\\\"}\",\"{\\\"id\\\":\\\"B08KH2J3FM\\\"}\",\"{\\\"id\\\":\\\"B08D54RBGP\\\"}\",\"{\\\"id\\\":\\\"1507212992\\\"}\",\"{\\\"id\\\":\\\"1635653894\\\"}\",\"{\\\"id\\\":\\\"B01MUB7BUV\\\"}\",\"{\\\"id\\\":\\\"0358120861\\\"}\",\"{\\\"id\\\":\\\"B08FV23D3F\\\"}\",\"{\\\"id\\\":\\\"B08FNMP9YY\\\"}\",\"{\\\"id\\\":\\\"1671590902\\\"}\",\"{\\\"id\\\":\\\"1641527692\\\"}\",\"{\\\"id\\\":\\\"1628603917\\\"}\",\"{\\\"id\\\":\\\"B07ZHPQBVZ\\\"}\",\"{\\\"id\\\":\\\"B08Y49Y63B\\\"}\",\"{\\\"id\\\":\\\"B08T2QRSN3\\\"}\",\"{\\\"id\\\":\\\"1729392164\\\"}\",\"{\\\"id\\\":\\\"B08T46R6XC\\\"}\",\"{\\\"id\\\":\\\"B08RRF5V1D\\\"}\",\"{\\\"id\\\":\\\"1592339727\\\"}\",\"{\\\"id\\\":\\\"1628602929\\\"}\",\"{\\\"id\\\":\\\"1984857088\\\"}\",\"{\\\"id\\\":\\\"0316529583\\\"}\",\"{\\\"id\\\":\\\"1641524820\\\"}\",\"{\\\"id\\\":\\\"1628602635\\\"}\",\"{\\\"id\\\":\\\"B00GRIR87M\\\"}\",\"{\\\"id\\\":\\\"B08FBHN5H7\\\"}\",\"{\\\"id\\\":\\\"B06ZYSS7HS\\\"}\"]},\"autoAdjustHeightFreescroll\":true,\"first_item_flush_left\":false,\"initThreshold\":100,\"loadingThresholdPixels\":100,\"name\":\"p13n-sc-shoveler_n1in5tlbg2h\",\"nextRequestSize\":6,\"set_size\":65}","faceoutspecs":"{}","faceoutkataname":"GeneralFaceout","individuals":"0","language":"en-US","linkparameters":"{\"pd_rd_w\":\"eouzj\",\"pf_rd_p\":\"45451e33-456f-46b5-8f06-aedad504c3d0\",\"pf_rd_r\":\"6Q3MPZHQQ2ESWZND1K8T\",\"pd_rd_r\":\"e5e43c03-d78d-41d3-9064-87af93f9856b\",\"pd_rd_wg\":\"PdhmI\"}","marketplaceid":"ATVPDKIKX0DER","name":"p13n-sc-shoveler_n1in5tlbg2h","offset":"6","reftagprefix":"pd_sim","aDisplayStrategy":"swap","aTransitionStrategy":"swap","aAjaxStrategy":"promise","ids":["{\"id\":\"B07SPSXHZ8\"}","{\"id\":\"B08GG2RL1D\"}","{\"id\":\"1507212305\"}","{\"id\":\"B08QFBMSFT\"}","{\"id\":\"164152247X\"}","{\"id\":\"1673455980\"}","{\"id\":\"B084DD8WHP\"}","{\"id\":\"1706342667\"}","{\"id\":\"1628603135\"}"],"indexes":[6,7,8,9,10,11,12,13,14]}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
    # for product_link in product_links:
    s.headers['x-amz-acp-params'] = "tok=0DV5j8DDJsH8JQfdVFxJFD3p6AZraMOZTik-kgzNi08;ts=1619674837835;rid=ER1GSMM13VTETPS90K43;d1=251;d2=0;tpm=CGHBD;ref=rtpb"
    res = s.post(url,json=payload)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select("li.a-carousel-card-fragment > a.a-link-normal > div[data-rows]"):
        print(item.text)

How can I scrape all the books from customers who bought section without hardcoding payload?

like image 628
robots.txt Avatar asked Apr 29 '21 06:04

robots.txt


People also ask

Is it legal to scrape stackoverflow?

It is legal to scrape Stack Overflow. (Google does it when it gives you relevant Stack Overflow pages for your queries). However, please make sure you respect their robots. txt file.

How to scrape book information from a website in Python?

Scrape the book’s information from the website by using the Python Library which is BeautifulSoup. What is Web Scraping? Web scraping is the process of extracting data from websites. All the job is carried out by a piece of code which is called a “scraper”.

How to scrape the information of book titles?

For the information of book titles, we scrape the data of product names and stored in the list with named names. The data processing that applied in this section is change the book titles to uppercase.

How to scrape data in Python web scrapping?

One such Web Scrapping tool is BeautifulSoup. BeautifulSoup is a Python Web Scrapping library for pulling data out and parsing of HTML and XML files. To install BeautifulSoup type the below command in the terminal.

How to make your web scraper work better?

Ideally, our web scraper should obey the instructions in the robots.txt file. Even if the robots.txt allows scraping, doing it aggresively can overwhelm the server, causing performance issues or resource crunch on the server-end (even failures). It's good to include a back-off time if the server is starting to take longer to respond.


Video Answer


1 Answers

Everything that you need in order to get the carousel data is in the initial request when you query for the product URL.

You need to get full product HTML, scoop out the carousel data and reuse parts of it to construct a valid payload that can be used in the follow-up POST requests.

However, getting the product HTML is the hardest part, at least on my end, as Amazon will either block or throw a CAPTCHA, if you request the HTML too often.

Using a proxy or VPN helps. Swapping the product URL does help sometimes too.

Summing up, the key is to get the product HTML. The subsequent requests are easy to make and are not throttled, AFAIK.

Here's how to get the data for and from the carousel:

import json
import re

import requests
from bs4 import BeautifulSoup


# The chunk is how many carousel items are going to be requested for;
# this can vary from 4 - 10 items, as on the web-page.
# Also, the other list is used as the indexes key in the payload.
def get_idx_and_indexes(carousel_ids: list, chunk: int = 5) -> iter:
    for index in range(0, len(carousel_ids), chunk):
        tmp = carousel_ids[index:index + chunk]
        yield tmp, [carousel_ids.index(item) for item in tmp]


headers = {
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/90.0.4430.93 Safari/537.36",
}

product_url = 'https://www.amazon.de/Rust-Programming-Language-Covers-2018/dp/1718500440/'
# Getting the product HTML as it carries all the carousel data items 
with requests.Session() as session:
    r = session.get("https://www.amazon.com", headers=headers)
    page = session.get(product_url, headers=headers)

# This is where the carousel data sits along with all the items needed to make
# the following requests e.g. items, acp-params, linkparameters, marketplaceid etc.
initial_soup = BeautifulSoup(
    re.search(r"<!--CardsClient-->(.*)<input", page.text).group(1),
    "lxml",
).find_all("div")

# Preparing all the details for subsequent requests to carousel_endpoint
item_ids = json.loads(initial_soup[3]["data-a-carousel-options"])["ajax"]["id_list"]
payload = {
    "aAjaxStrategy": "promise",
    "aCarouselOptions": initial_soup[3]["data-a-carousel-options"],
    "aDisplayStrategy": "swap",
    "aTransitionStrategy": "swap",
    "faceoutkataname": "GeneralFaceout",
    "faceoutspecs": "{}",
    "individuals": "0",
    "language": "en-US",
    "linkparameters": initial_soup[0]["data-acp-tracking"],
    "marketplaceid": initial_soup[3]["data-marketplaceid"],
    "name": "p13n-sc-shoveler_hgm4oj1hneo",  # this changes by can be ignored
    "offset": "6",
    "reftagprefix": "pd_sim",
}

headers.update(
    {
        "x-amz-acp-params": initial_soup[0]["data-acp-params"],
        "x-requested-with": "XMLHttpRequest",
    }
)

# looping through the carousel data and performing requests
carousel_endpoint = " https://www.amazon.com/acp/p13n-desktop-carousel/funjjvdbohwkuezi/getCarouselItems"
for ids, indexes in get_idx_and_indexes(item_ids):
    payload["ids"] = ids
    payload["indexes"] = indexes
    # The actual carousel data
    response = session.post(carousel_endpoint, json=payload, headers=headers)
    carousel = BeautifulSoup(response.text, "lxml").find_all("a")
    print("\n".join(a.getText() for a in carousel))

This should output:

Cracking the Coding Interview: 189 Programming Questions and Solutions
Gayle Laakmann McDowell
4.7 out of 5 starsâ4,864
#1 Best Sellerin Computer Hacking
$24.00

Container Security: Fundamental Technology Concepts that Protect Containerized Applications
Liz Rice
4.7 out of 5 starsâ102
$35.42

Linux Bible
Christopher Negus
4.8 out of 5 starsâ245
#1 Best Sellerin Linux Servers
$31.99

System Design Interview â An insider's guide, Second Edition
Alex Xu
4.5 out of 5 starsâ568
#1 Best Sellerin Bioinformatics
$24.99

Ansible for DevOps: Server and configuration management for humans
Jeff Geerling
4.6 out of 5 starsâ127
$17.35

Effective C: An Introduction to Professional C Programming
Robert C. Seacord
4.5 out of 5 starsâ94
$32.99

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
Aurélien Géron
4.8 out of 5 starsâ1,954
#1 Best Sellerin Computer Neural Networks
$32.93

Head First Design Patterns: Building Extensible and Maintainable Object-Oriented Software
Eric Freeman
4.7 out of 5 starsâ67
$41.45

Fluent Python: Clear, Concise, and Effective Programming
Luciano Ramalho
4.6 out of 5 starsâ523
54 offers from $32.24

TCP/IP Illustrated, Volume 1: The Protocols (Addison-Wesley Professional Computing Series)
4.6 out of 5 starsâ199
$63.26

Operating Systems: Three Easy Pieces
4.7 out of 5 starsâ224
#1 Best Sellerin Computer Operating Systems Theory
$24.61

Software Engineering at Google: Lessons Learned from Programming Over Time
Titus Winters
4.6 out of 5 starsâ243
$44.52

and so on ...
like image 121
baduker Avatar answered Nov 06 '22 00:11

baduker