Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get title of a page with cheerio

I'm trying to get the title tag of a url with cheerio. But, I'm getting empty string values. This is my code:

app.get('/scrape', function(req, res){

    url = 'http://nrabinowitz.github.io/pjscrape/';

    request(url, function(error, response, html){
        if(!error){
                        var $ = cheerio.load(html);

            var title, release, rating;
            var json = { title : "", release : "", rating : ""};

            $('title').filter(function(){
                //var data = $(this);
                var data = $(this);
                        title = data.children().first().text();            
                        release = data.children().last().children().text();

                json.title = title;
                json.release = release;
            })

            $('.star-box-giga-star').filter(function(){
                var data = $(this);
                rating = data.text();

                json.rating = rating;
            })
        }


        fs.writeFile('output.json', JSON.stringify(json, null, 4), function(err){

            console.log('File successfully written! - Check your project directory for the output.json file');

        })

        // Finally, we'll just send out a message to the browser reminding you that this app does not have a UI.
        res.send('Check your console!')
    })
});
like image 431
Filipe Ferminiano Avatar asked Apr 27 '14 17:04

Filipe Ferminiano


People also ask

How do you get attribute value in Cheerio?

Attributes can be retrieved with attr function. import fetch from 'node-fetch'; import { load } from 'cheerio'; const url = 'http://webcode.me'; const response = await fetch(url); const body = await response. text(); let $ = load(body); let lnEl = $('link'); let attrs = lnEl. attr(); console.

Can Cheerio parse XML?

Cheerio can parse nearly any HTML or XML document.

What is Cheeriojs?

Cheerio js is a Javascript technology used for web-scraping in server-side implementations. Web-scraping is a scripted method of extracting data from a website that can be tailored to your use-case. NodeJS is often used as the server-side platform.

How to get the title of an HTML document using Cheerio?

We install cheerio, request, and local-web-server . Inside the project directory, where we have the index.html file, we start the local web server. It automatically serves the index.html file on three different locations. In the first example, we get the title of the document. The example prints the title of the HTML document.

How to scrape a web page using Cheerio?

If you want to use cheerio for scraping a web page, you need to first fetch the markup using packages like axios or node-fetch among others. In this section, you will learn how to scrape a web page using cheerio.

What is Cheerio and how do I use it?

For making HTTP requests to get data from the web page we will use the Got library, and for parsing through the HTML we'll use Cheerio. Cheerio implements a subset of core jQuery, making it a familiar tool to use for lots of JavaScript developers. Let's dive into how to use it.

How do I loop through a list in Cheerio?

Cheerio provides the .each method for looping through several selected elements. Below, we are selecting all the li elements and looping through them using the .each method. We log the text content of each list item on the terminal. Add the code below to your app.js file.


2 Answers

request(url, function (error, response, body) 
{
  if (!error && response.statusCode == 200) 
  {
    var $ = cheerio.load(body);
    var title = $("title").text();
  }
})

Using Javascript we extract the text contained within the "title" tags.

like image 94
Robert Ryan Avatar answered Oct 12 '22 22:10

Robert Ryan


If Robert Ryan's solution still doesn't work, I'd be suspicious of the formatting of the original page, which may be malformed somehow.

In my case I was accepting gzip and other compression but never decoding, so Cheerio was trying to parse compressed binary bits. When console logging the original body, I was able to spot the binary text instead of plain text HTML.

like image 29
David Calhoun Avatar answered Oct 13 '22 00:10

David Calhoun