Here is the code which i used for parsing of web page.I did it in rails console.But i am not getting any output in my rails console.The site which i want to scrape is having lazy loading <pre class="prettyprint"><code>require 'nokogiri' require 'open-uri' page = 1 while true url = "http://www.justdial.com/functions"+"/ajxsearch.php?national_search=0&act=pagination&city=Delhi+%2F+NCR&search=Pandits"+"&where=Delhi+Cantt&catid=1195&psearch=&prid=&page=#{page}" doc = Nokogiri::HTML(open(url)) doc = Nokogiri::HTML(doc.at_css('#ajax').text) d = doc.css(".rslwrp") d.each do |t| puts t.css(".jrcw").text puts t.css("span.jcn").text puts t.css(".jaid").text puts t.css(".estd").text page+=1 end end </code></pre>

You have 2 options here: <ol> <li>Switch pure HTTP scraping to some tool which supports javascript evaluation, such as Capybara (with proper driver selected). This can be slow, since you're running headless browser under the hood plus you'll have to set some timeouts or figure another way to make sure the blocks of text you're interested in are loaded before you start any scraping.</li> <li>Second option is to use Web Developer console and figure out how those blocks of text are loaded (which AJAX calls, their parameters and etc.) and implement them in your scraper. This is more advanced approach, but more performant, since you won't make any extra work, like you've done in option 1.</li> </ol> Have a nice day! UPDATE: Your code above doesn't work, because the response is HTML code wrapped in JSON object, while you're trying to parse it as a raw HTML. It looks like this: <pre class="prettyprint"><code>{ "error": 0, "msg": "request successful", "paidDocIds": "some ids here", "itemStartIndex": 20, "lastPageNum": 50, "markup": 'LOTS AND LOTS AND LOTS OF MARKUP' } </code></pre> What you need is unwrap JSON and then parse as HTML: <pre class="prettyprint"><code>require 'json' json = JSON.parse(open(url).read) # make sure you check http errors here html = json['markup'] # can this field be empty? check for the json['error'] field doc = Nokogiri::HTML(html) # parse as you like </code></pre> I'd also advise you against using <code>open-uri</code> since your code may become vulnerable if you use dynamic urls because of the way <code>open-uri</code> works (read the linked article for the details) and use good and more feature-wise libraries such as HTTParty and RestClient. UPDATE 2: Minimal working script for me: <pre class="prettyprint"><code>require 'json' require 'open-uri' require 'nokogiri' url = 'http://www.justdial.com/functions/ajxsearch.php?national_search=0&act=pagination&city=Delhi+%2F+NCR&search=Pandits&where=Delhi+Cantt&catid=1195&psearch=&prid=&page=2' json = JSON.parse(open(url).read) # make sure you check http errors here html = json['markup'] # can this field be empty? check for the json['error'] field doc = Nokogiri::HTML(html) # parse as you like puts doc.at_css('#newphoto10').attr('title') # => Dr Raaj Batra Lal Kitab Expert in East Patel Nagar, Delhi </code></pre>

How to scrape pages which have lazy loading

Tags:

ruby

web-scraping

nokogiri

Here is the code which i used for parsing of web page.I did it in rails console.But i am not getting any output in my rails console.The site which i want to scrape is having lazy loading

require 'nokogiri'
require 'open-uri'

page = 1
while true
  url =  "http://www.justdial.com/functions"+"/ajxsearch.php?national_search=0&act=pagination&city=Delhi+%2F+NCR&search=Pandits"+"&where=Delhi+Cantt&catid=1195&psearch=&prid=&page=#{page}"


  doc = Nokogiri::HTML(open(url))
  doc = Nokogiri::HTML(doc.at_css('#ajax').text)
  d = doc.css(".rslwrp")
  d.each do |t|
     puts t.css(".jrcw").text
     puts t.css("span.jcn").text
     puts t.css(".jaid").text
     puts t.css(".estd").text
    page+=1
  end
end

979

asked Sep 11 '15 09:09

Rajesh Choudhary

1 Answers

You have 2 options here:

Switch pure HTTP scraping to some tool which supports javascript evaluation, such as Capybara (with proper driver selected). This can be slow, since you're running headless browser under the hood plus you'll have to set some timeouts or figure another way to make sure the blocks of text you're interested in are loaded before you start any scraping.
Second option is to use Web Developer console and figure out how those blocks of text are loaded (which AJAX calls, their parameters and etc.) and implement them in your scraper. This is more advanced approach, but more performant, since you won't make any extra work, like you've done in option 1.

Have a nice day!

UPDATE:

Your code above doesn't work, because the response is HTML code wrapped in JSON object, while you're trying to parse it as a raw HTML. It looks like this:

{
  "error": 0,
  "msg": "request successful",
  "paidDocIds": "some ids here",
  "itemStartIndex": 20,
  "lastPageNum": 50,
  "markup": 'LOTS AND LOTS AND LOTS OF MARKUP'
}

What you need is unwrap JSON and then parse as HTML:

require 'json' 

json = JSON.parse(open(url).read) # make sure you check http errors here
html = json['markup'] # can this field be empty? check for the json['error'] field
doc = Nokogiri::HTML(html) # parse as you like

I'd also advise you against using open-uri since your code may become vulnerable if you use dynamic urls because of the way open-uri works (read the linked article for the details) and use good and more feature-wise libraries such as HTTParty and RestClient.

UPDATE 2: Minimal working script for me:

require 'json'
require 'open-uri'
require 'nokogiri'

url = 'http://www.justdial.com/functions/ajxsearch.php?national_search=0&act=pagination&city=Delhi+%2F+NCR&search=Pandits&where=Delhi+Cantt&catid=1195&psearch=&prid=&page=2'

json = JSON.parse(open(url).read) # make sure you check http errors here
html = json['markup'] # can this field be empty? check for the json['error'] field
doc = Nokogiri::HTML(html) # parse as you like
puts doc.at_css('#newphoto10').attr('title')
# => Dr Raaj Batra Lal Kitab Expert in East Patel Nagar, Delhi

131

answered Oct 08 '22 00:10

Alexey Shein

Related questions
                            
                                Using url_for in Rails/Capybara/Poltergeist spec sends the driver to example.com instead of the app
                            
                                Ruby module prepend vs derivation
                            
                                Multiple assertions for single setup in RSpec
                            
                                Undefined method render in Rails controller - Trying to respond to Sendgrid with a 200 status code
                            
                                How to push from Gitlab to Github with webhooks
                            
                                Stub Method error in request spec
                            
                                Understanding attributes in AWS DynamoDB with Ruby
                            
                                Grape: Rescue from invalid JSON
                            
                                ByeBug Debugger working only once [Rails 4]
                            
                                Grape: required params with grape-entity
                            
                                Rails nested resource parameters
                            
                                Convert Excel and Word files to PDF Using ruby
                            
                                Symfony assetic sass filter via node-sass?
                            
                                Segmentation fault when calling a Rust lib with Ruby FFI
                            
                                Using transaction in Ruby On Rails controller method
                            
                                How to add routes for a new template?
                            
                                Ruby Fog gem: how to create sub-directories?
                            
                                How do I hide "Save and Add Another" button from edit form in rails_admin?
                            
                                Running Ruby programs in Atom
                            
                                Could not find ffi in any of the sources

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With