Scraping dynamic content using python-Scrapy

Tags:

Disclaimer: I've seen numerous other similar posts on StackOverflow and tried to do it the same way but was they don't seem to work on this website.

I'm using Python-Scrapy for getting data from koovs.com.

However, I'm not able to get the product size, which is dynamically generated. Specifically, if someone could guide me a little on getting the 'Not available' size tag from the drop-down menu on this link, I'd be grateful.

I am able to get the size list statically, but doing that I only get the list of sizes but not which of them are available.

460

asked May 20 '15 09:05

Pravesh Jain

1 Answers

You can also solve it with ScrapyJS (no need for selenium and a real browser):

This library provides Scrapy+JavaScript integration using Splash.

Follow the installation instructions for Splash and ScrapyJS, start the splash docker container:

$ docker run -p 8050:8050 scrapinghub/splash

Put the following settings into settings.py:

SPLASH_URL = 'http://192.168.59.103:8050'   DOWNLOADER_MIDDLEWARES = {     'scrapyjs.SplashMiddleware': 725, }  DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter'

And here is your sample spider that is able to see the size availability information:

# -*- coding: utf-8 -*- import scrapy   class ExampleSpider(scrapy.Spider):     name = "example"     allowed_domains = ["koovs.com"]     start_urls = (         'http://www.koovs.com/only-onlall-stripe-ls-shirt-59554.html?from=category-651&skuid=236376',     )      def start_requests(self):         for url in self.start_urls:             yield scrapy.Request(url, self.parse, meta={                 'splash': {                     'endpoint': 'render.html',                     'args': {'wait': 0.5}                 }             })      def parse(self, response):         for option in response.css("div.select-size select.sizeOptions option")[1:]:             print option.xpath("text()").extract()

Here is what is printed on the console:

[u'S / 34 -- Not Available'] [u'L / 40 -- Not Available'] [u'L / 42']

answered Sep 21 '22 18:09

alecxe

Related questions
                            
                                How "with" is better than try/catch to open a file in Python?
                            
                                Python Class Members Initialization
                            
                                How to pack and unpack using ctypes (Structure <-> str)
                            
                                How do I get my computer's fully qualified domain name in Python?
                            
                                Python logging - check location of log files?
                            
                                sqlalchemy : executing raw sql with parameter bindings
                            
                                Is there a matplotlib equivalent of MATLAB's datacursormode?
                            
                                Python: simple list merging based on intersections
                            
                                Selenium "Unable to find a matching set of capabilities" despite driver being in /usr/local/bin
                            
                                Sorting a dictionary by value then key
                            
                                Select elements of numpy array via boolean mask array
                            
                                python-asyncio TypeError: object dict can't be used in 'await' expression
                            
                                Profiling python C extensions
                            
                                How to make a short and long version of a required argument using Python Argparse?
                            
                                How can I visualize the weights(variables) in cnn in Tensorflow?
                            
                                transform scipy sparse csr to pandas?
                            
                                Replace textarea with rich text editor in Django Admin?
                            
                                How can I host my own private conda repository?
                            
                                TypeError: Invalid dimensions for image data when plotting array with imshow()
                            
                                How to use asyncio with existing blocking library?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scraping dynamic content using python-Scrapy

Tags:

python

web-scraping

scrapy

Pravesh Jain

People also ask

1 Answers

alecxe

Recent Activity

Donate For Us