I tried get code html of a web page, but the web contains some javascript code that generates some data that I need.
http = Net::HTTP.new('localhost')
path = '/files.php'
# POST request -> logging in
data = ''
headers = {
'Referer' => 'http://localhost:8080/files.php',
'User-Agent' => 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0',
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' => 'es-ES,es;q=0.8,en-US;q=0.5,en;q=0.3',
'Content-Encoding' => 'gzip, deflate',
'Connection' => 'keep-alive',
'Cookie' => ''
}
resp, data = http.post(path, data, headers)
puts resp.body
But this only returns the html without evaluate the javascript. I would like get the final html after evaluate the javascript of the page.
Doing scraping with JavaScript enabled is hard. Basically, you need to be able to fully emulate the browser if you want to do it reliably.
Fortunately, there are gems out there that do exactly that. You could use Capybara with a JavaScript-capable driver like Selenium. For example (adapted from this blog post):
require "capybara"
require "capybara/dsl"
Capybara.run_server = false
Capybara.current_driver = :selenium
Capybara.app_host = "http://www.google.com/"
class Scraper
include Capybara::DSL
def scrape
visit('/')
fill_in "q", :with => "Capybara"
click_button "Google Search"
all(:xpath, "//li[@class='g']/h3/a").each { |a| puts a[:href] }
end
end
There are alternative JavaScript drivers out there if Selenium isn't your cup of tea (it literally automates your browser, e.g. Firefox, rather than implementing a separate, "headless", browser of its own). See, for example, capybara-webkit or poltergeist, for headless browser drivers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With