Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting the page title from a scraped webpage [closed]

Tags:

node.js

var http = require('http');
var urlOpts = {host: 'www.nodejs.org', path: '/', port: '80'};
http.get(urlOpts, function (response) {
response.on('data', function (chunk) {
var str=chunk.toString();
var re = new RegExp("(<\s*title[^>]*>(.+?)<\s*/\s*title)\>", "g")
console.log(str.match(re));
});

});

Output

user@dev ~ $ node app.js [ 'node.js' ] null null

I only need to get the title.

like image 229
user1777212 Avatar asked Oct 26 '12 13:10

user1777212


2 Answers

I would suggest using RegEx.exec instead of String.match. You can also define the regular expression using the literal syntax, and only once:

var http = require('http');
var urlOpts = {host: 'www.nodejs.org', path: '/', port: '80'};
var re = /(<\s*title[^>]*>(.+?)<\s*\/\s*title)>/gi;
http.get(urlOpts, function (response) {
    response.on('data', function (chunk) {
        var str=chunk.toString();
        var match = re.exec(str);
        if (match && match[2]) {
          console.log(match[2]);
        }
    });    
});

The code also assumes that the title will be completely in one chunk, and not split between two chunks. It would probably be best to keep an aggregation of chunks, in case the title is split between chunks. You may also want to stop looking for the title once you've found it.

like image 96
bdukes Avatar answered Oct 20 '22 08:10

bdukes


Try this:

var re = new RegExp("<title>(.*?)</title>", "i");
console.log(str.match(re)[1]);
like image 39
gradosevic Avatar answered Oct 20 '22 10:10

gradosevic