Use scrapy to get list of urls, and then scrape content inside those urls

Tags:

I need a Scrapy spider to scrape the following page (https://www.phidgets.com/?tier=1&catid=64&pcid=57) for each URL (30 products, so 30 urls) and then go into each product via that url and scrape the data inside.

I have the second part working exactly as I want:

import scrapy

class ProductsSpider(scrapy.Spider):
    name = "products"
    start_urls = [
        'https://www.phidgets.com/?tier=1&catid=64&pcid=57',
    ]

    def parse(self, response):
        for info in response.css('div.ph-product-container'):
            yield {
                'product_name': info.css('h2.ph-product-name::text').extract_first(),
                'product_image': info.css('div.ph-product-img-ctn a').xpath('@href').extract(),
                'sku': info.css('span.ph-pid').xpath('@prod-sku').extract_first(),
                'short_description': info.css('div.ph-product-summary::text').extract_first(),
                'price': info.css('h2.ph-product-price > span.price::text').extract_first(),
                'long_description': info.css('div#product_tab_1').extract_first(),
                'specs': info.css('div#product_tab_2').extract_first(),
            }

        # next_page = response.css('div.ph-summary-entry-ctn a::attr("href")').extract_first()
        # if next_page is not None:
        #     yield response.follow(next_page, self.parse)

But I don't know how to do the first part. As you will see I have the main page (https://www.phidgets.com/?tier=1&catid=64&pcid=57) set as the start_url. But how do I get it to populate the start_urls list with all 30 urls I need crawled?

789

asked Jul 04 '17 20:07

Adriano C R

1 Answers

I am not able to test at this moment, so please let me know if this works for you so I can edit it should there be any bugs.

The idea here is that we find every link in the first page and yield new scrapy requests passing your product parsing method as a callback

import scrapy
from urllib.parse import urljoin

class ProductsSpider(scrapy.Spider):
    name = "products"
    start_urls = [
        'https://www.phidgets.com/?tier=1&catid=64&pcid=57',
    ]

    def parse(self, response):
        products = response.xpath("//*[contains(@class, 'ph-summary-entry-ctn')]/a/@href").extract()
        for p in products:
            url = urljoin(response.url, p)
            yield scrapy.Request(url, callback=self.parse_product)

    def parse_product(self, response):
        for info in response.css('div.ph-product-container'):
            yield {
                'product_name': info.css('h2.ph-product-name::text').extract_first(),
                'product_image': info.css('div.ph-product-img-ctn a').xpath('@href').extract(),
                'sku': info.css('span.ph-pid').xpath('@prod-sku').extract_first(),
                'short_description': info.css('div.ph-product-summary::text').extract_first(),
                'price': info.css('h2.ph-product-price > span.price::text').extract_first(),
                'long_description': info.css('div#product_tab_1').extract_first(),
                'specs': info.css('div#product_tab_2').extract_first(),
            }

117

answered Nov 15 '22 00:11

Henrique Coura

Related questions
                            
                                How to do prediction with Sklearn Model inside Spark?
                            
                                How to tell if Python module is a namespace module
                            
                                One figure to present multiple pie chart with different size
                            
                                How does django detect file changes
                            
                                grequests with requests has collision
                            
                                Modifying timestamps in pandas to make index unique
                            
                                Django how to get multiple context_object_name for multiple queryset from single view to single template
                            
                                Python how to get value from argparse from variable, but not the name of the variable?
                            
                                Create a matrix from a vector where each row is a shifted version of the vector
                            
                                Deploying asgi and wsgi on Heroku
                            
                                How to play mp3 from bytes?
                            
                                cbind (R function) equivalent in numpy
                            
                                How to import and call a Python function in a Jinja template? [closed]
                            
                                Get keys of pandas.Series.value_counts
                            
                                How can I display the test name *after* the test using pytest?
                            
                                Convert array into percentiles
                            
                                why is that people use sqlalchemy CORE to save data and use sqlalchemy ORM to query data
                            
                                what is the difference between scipy.stats module and numpy.random module, between similar methods that both modules have?
                            
                                How to get list of values in ImageDataGenerator.flow_from_directory Keras?
                            
                                Unresolved reference when calling a global variable?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Use scrapy to get list of urls, and then scrape content inside those urls

Tags:

python

web-scraping

scrapy

Adriano C R

People also ask

1 Answers

Henrique Coura

Recent Activity

Donate For Us