Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scrape a web page with dynamic content added by JavaScript?

I am trying to scrape this webpage, it have lazy load as we scroll it gets loaded. Using Nokogiri I am able to scrape the initial page, but not the rest of the page which load after scrolling.

like image 332
shamshul2007 Avatar asked Sep 07 '13 07:09

shamshul2007


1 Answers

To get lazy loaded page, scrap the following pages:

http://www.flipkart.com/mens-footwear/shoes/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&sid=osp%2Ccil%2Cnit%2Ce1f&start=31&ajax=true
http://www.flipkart.com/mens-footwear/shoes/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&sid=osp%2Ccil%2Cnit%2Ce1f&start=46&ajax=true
http://www.flipkart.com/mens-footwear/shoes/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&sid=osp%2Ccil%2Cnit%2Ce1f&start=61&ajax=true
...

require 'rubygems'
require 'nokogiri'
require 'mechanize'
require 'open-uri'

number = 1
while true
  url = "http://www.flipkart.com/mens-footwear/shoes" +
        "/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&" +
        "sid=osp%2Ccil%2Cnit%2Ce1f&start=#{number}&ajax=true"

  doc = Nokogiri::HTML(open(url))
  doc = Nokogiri::HTML(doc.at_css('#ajax').text)

  products = doc.css(".browse-product")
  break if products.size == 0

  products.each do |item|
    title = item.at_css(".fk-display-block,.title").text.strip
    price = (item.at_css(".pu-final").text || '').strip
    link = item.at_xpath(".//a[@class='fk-display-block']/@href")
    image = item.at_xpath(".//div/a/img/@src")

    puts number
    puts "#{title} - #{price}"
    puts "http://www.flipkart.com#{link}"
    puts image
    puts "========================"

    number += 1
  end
end
like image 131
falsetru Avatar answered Sep 20 '22 21:09

falsetru