Disclaimer: I've seen numerous other similar posts on StackOverflow and tried to do it the same way but was they don't seem to work on this website.
I'm using Python-Scrapy for getting data from koovs.com.
However, I'm not able to get the product size, which is dynamically generated. Specifically, if someone could guide me a little on getting the 'Not available' size tag from the drop-down menu on this link, I'd be grateful.
I am able to get the size list statically, but doing that I only get the list of sizes but not which of them are available.
Getting Started. In this part, after installation scrapy, you have a chose a local in your computer for creating a project Scrapy, and open the terminal and write the command scrapy startproject [name of project], which creating project scrapy. After creating the path of the project, they are necessary to enter it.
There are two approaches to scraping a dynamic webpage: Scrape the content directly from the JavaScript. Scrape the website as we view it in our browser — using Python packages capable of executing the JavaScript.
It is ideally not possible because BeautifulSoup is just an HTML parser. So in those scenarios it is better to use Selenium to pull dynamic content.
You can also solve it with ScrapyJS
(no need for selenium
and a real browser):
This library provides Scrapy+JavaScript integration using Splash.
Follow the installation instructions for Splash
and ScrapyJS
, start the splash docker container:
$ docker run -p 8050:8050 scrapinghub/splash
Put the following settings into settings.py
:
SPLASH_URL = 'http://192.168.59.103:8050' DOWNLOADER_MIDDLEWARES = { 'scrapyjs.SplashMiddleware': 725, } DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter'
And here is your sample spider that is able to see the size availability information:
# -*- coding: utf-8 -*- import scrapy class ExampleSpider(scrapy.Spider): name = "example" allowed_domains = ["koovs.com"] start_urls = ( 'http://www.koovs.com/only-onlall-stripe-ls-shirt-59554.html?from=category-651&skuid=236376', ) def start_requests(self): for url in self.start_urls: yield scrapy.Request(url, self.parse, meta={ 'splash': { 'endpoint': 'render.html', 'args': {'wait': 0.5} } }) def parse(self, response): for option in response.css("div.select-size select.sizeOptions option")[1:]: print option.xpath("text()").extract()
Here is what is printed on the console:
[u'S / 34 -- Not Available'] [u'L / 40 -- Not Available'] [u'L / 42']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With