Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping dynamic content using python-Scrapy

Disclaimer: I've seen numerous other similar posts on StackOverflow and tried to do it the same way but was they don't seem to work on this website.

I'm using Python-Scrapy for getting data from koovs.com.

However, I'm not able to get the product size, which is dynamically generated. Specifically, if someone could guide me a little on getting the 'Not available' size tag from the drop-down menu on this link, I'd be grateful.

I am able to get the size list statically, but doing that I only get the list of sizes but not which of them are available.

like image 460
Pravesh Jain Avatar asked May 20 '15 09:05

Pravesh Jain


People also ask

How do you scrape a dynamic website with Scrapy?

Getting Started. In this part, after installation scrapy, you have a chose a local in your computer for creating a project Scrapy, and open the terminal and write the command scrapy startproject [name of project], which creating project scrapy. After creating the path of the project, they are necessary to enter it.

Can you scrape dynamic content from a website?

There are two approaches to scraping a dynamic webpage: Scrape the content directly from the JavaScript. Scrape the website as we view it in our browser — using Python packages capable of executing the JavaScript.

Can BeautifulSoup scraping dynamic content?

It is ideally not possible because BeautifulSoup is just an HTML parser. So in those scenarios it is better to use Selenium to pull dynamic content.


1 Answers

You can also solve it with ScrapyJS (no need for selenium and a real browser):

This library provides Scrapy+JavaScript integration using Splash.

Follow the installation instructions for Splash and ScrapyJS, start the splash docker container:

$ docker run -p 8050:8050 scrapinghub/splash 

Put the following settings into settings.py:

SPLASH_URL = 'http://192.168.59.103:8050'   DOWNLOADER_MIDDLEWARES = {     'scrapyjs.SplashMiddleware': 725, }  DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter' 

And here is your sample spider that is able to see the size availability information:

# -*- coding: utf-8 -*- import scrapy   class ExampleSpider(scrapy.Spider):     name = "example"     allowed_domains = ["koovs.com"]     start_urls = (         'http://www.koovs.com/only-onlall-stripe-ls-shirt-59554.html?from=category-651&skuid=236376',     )      def start_requests(self):         for url in self.start_urls:             yield scrapy.Request(url, self.parse, meta={                 'splash': {                     'endpoint': 'render.html',                     'args': {'wait': 0.5}                 }             })      def parse(self, response):         for option in response.css("div.select-size select.sizeOptions option")[1:]:             print option.xpath("text()").extract() 

Here is what is printed on the console:

[u'S / 34 -- Not Available'] [u'L / 40 -- Not Available'] [u'L / 42'] 
like image 96
alecxe Avatar answered Sep 21 '22 18:09

alecxe