Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scrape pages which have lazy loading

Here is the code which i used for parsing of web page.I did it in rails console.But i am not getting any output in my rails console.The site which i want to scrape is having lazy loading

require 'nokogiri'
require 'open-uri'

page = 1
while true
  url =  "http://www.justdial.com/functions"+"/ajxsearch.php?national_search=0&act=pagination&city=Delhi+%2F+NCR&search=Pandits"+"&where=Delhi+Cantt&catid=1195&psearch=&prid=&page=#{page}"


  doc = Nokogiri::HTML(open(url))
  doc = Nokogiri::HTML(doc.at_css('#ajax').text)
  d = doc.css(".rslwrp")
  d.each do |t|
     puts t.css(".jrcw").text
     puts t.css("span.jcn").text
     puts t.css(".jaid").text
     puts t.css(".estd").text
    page+=1
  end
end
like image 979
Rajesh Choudhary Avatar asked Sep 11 '15 09:09

Rajesh Choudhary


People also ask

Is webpage scraping legal?

Web scraping is legal if you scrape data publicly available on the internet. But some kinds of data are protected by international regulations, so be careful scraping personal data, intellectual property, or confidential data. Respect your target websites and use empathy to create ethical scrapers.


1 Answers

You have 2 options here:

  1. Switch pure HTTP scraping to some tool which supports javascript evaluation, such as Capybara (with proper driver selected). This can be slow, since you're running headless browser under the hood plus you'll have to set some timeouts or figure another way to make sure the blocks of text you're interested in are loaded before you start any scraping.

  2. Second option is to use Web Developer console and figure out how those blocks of text are loaded (which AJAX calls, their parameters and etc.) and implement them in your scraper. This is more advanced approach, but more performant, since you won't make any extra work, like you've done in option 1.

Have a nice day!

UPDATE:

Your code above doesn't work, because the response is HTML code wrapped in JSON object, while you're trying to parse it as a raw HTML. It looks like this:

{
  "error": 0,
  "msg": "request successful",
  "paidDocIds": "some ids here",
  "itemStartIndex": 20,
  "lastPageNum": 50,
  "markup": 'LOTS AND LOTS AND LOTS OF MARKUP'
}

What you need is unwrap JSON and then parse as HTML:

require 'json' 

json = JSON.parse(open(url).read) # make sure you check http errors here
html = json['markup'] # can this field be empty? check for the json['error'] field
doc = Nokogiri::HTML(html) # parse as you like

I'd also advise you against using open-uri since your code may become vulnerable if you use dynamic urls because of the way open-uri works (read the linked article for the details) and use good and more feature-wise libraries such as HTTParty and RestClient.

UPDATE 2: Minimal working script for me:

require 'json'
require 'open-uri'
require 'nokogiri'

url = 'http://www.justdial.com/functions/ajxsearch.php?national_search=0&act=pagination&city=Delhi+%2F+NCR&search=Pandits&where=Delhi+Cantt&catid=1195&psearch=&prid=&page=2'

json = JSON.parse(open(url).read) # make sure you check http errors here
html = json['markup'] # can this field be empty? check for the json['error'] field
doc = Nokogiri::HTML(html) # parse as you like
puts doc.at_css('#newphoto10').attr('title')
# => Dr Raaj Batra Lal Kitab Expert in East Patel Nagar, Delhi
like image 131
Alexey Shein Avatar answered Oct 08 '22 00:10

Alexey Shein