Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I scrape pages with dynamic content using node.js?

I am trying to scrape a website but I don't get some of the elements, because these elements are dynamically created.

I use the cheerio in node.js and My code is below.

var request = require('request'); var cheerio = require('cheerio'); var url = "http://www.bdtong.co.kr/index.php?c_category=C02";  request(url, function (err, res, html) {     var $ = cheerio.load(html);     $('.listMain > li').each(function () {         console.log($(this).find('a').attr('href'));     }); }); 

This code returns empty response, because when the page is loaded, the <ul id="store_list" class="listMain"> is empty.

The content has not been appended yet.

How can I get these elements using node.js? How can I scrape pages with dynamic content?

like image 973
JayD Avatar asked Feb 26 '15 09:02

JayD


People also ask

Is Nodejs good for web scraping?

Web scraping is the process of extracting data from a website in an automated way and Node. js can be used for web scraping. Even though other languages and frameworks are more popular for web scraping, Node. js can be utilized well to do the job too.

How do you scrape a dynamic website from Scrapy?

Getting Started. In this part, after installation scrapy, you have a chose a local in your computer for creating a project Scrapy, and open the terminal and write the command scrapy startproject [name of project], which creating project scrapy. After creating the path of the project, they are necessary to enter it.


2 Answers

Here you go;

var phantom = require('phantom');  phantom.create(function (ph) {   ph.createPage(function (page) {     var url = "http://www.bdtong.co.kr/index.php?c_category=C02";     page.open(url, function() {       page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() {         page.evaluate(function() {           $('.listMain > li').each(function () {             console.log($(this).find('a').attr('href'));           });         }, function(){           ph.exit()         });       });     });   }); }); 
like image 135
Safi Avatar answered Sep 19 '22 13:09

Safi


Check out GoogleChrome/puppeteer

Headless Chrome Node API

It makes scraping pretty trivial. The following example will scrape the headline over at npmjs.com (assuming .npm-expansions remains)

const puppeteer = require('puppeteer');  (async () => {   const browser = await puppeteer.launch();   const page = await browser.newPage();    await page.goto('https://www.npmjs.com/');    const textContent = await page.evaluate(() => {     return document.querySelector('.npm-expansions').textContent   });    console.log(textContent); /* No Problem Mate */    browser.close(); })(); 

evaluate will allow for the inspection of the dynamic element as this will run scripts on the page.

like image 40
scniro Avatar answered Sep 22 '22 13:09

scniro