Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I get the absolute path for '<img src=''>' in node from the a response.body

Tags:

So I want to use request-promise to pull the body of a page. Once I have the page I want to collect all the tags and get an array of src's of those images. Assume the src attributes on a page have both relative and absolute paths. I want an array of absolute paths for imgs on a page. I know I can use some string manipulation and the npm path to build the absolute path but I wanted to find a better way of doing it.

var rp = require('request-promise'),
    cheerio = require('cheerio');

var options = {
    uri: 'http://www.google.com',
    method: 'GET',
    resolveWithFullResponse: true
};

rp(options)
  .then (function (response) {
    $ = cheerio.load(response.body);
    var relativeLinks = $("img");
    relativeLinks.each( function() {
        var link = $(this).attr('src');
        console.log(link);
        if (link.startsWith('http')){
            console.log('abs');
        }
        else {
            console.log('rel');
        }
   });
});

results

  /logos/doodles/2016/phoebe-snetsingers-85th-birthday-5179281716019200-hp.gif
  rel
like image 820
bsego Avatar asked Jun 09 '16 18:06

bsego


People also ask

How do you find the absolute path of a relative path in node?

Use the path. resolve() method to get an absolute path of a file from a relative path in Node. js, e.g. path. resolve('./some-file.

What is absolute path in JavaScript?

An absolute import path is a path that starts from a root, and you need to define a root first. In a typical JavaScript/TypeScript project, a common root is the src directory. For file1.


2 Answers

Store your page URL as a variable use url.resolve to join the pieces together. In the Node REPL this works for both relative and absolute paths (hence the "resolving"):

$:~/Projects/test$ node
> var base = "https://www.google.com";
undefined
> var imageSrc = "/logos/doodles/2016/phoebe-snetsingers-85th-birthday-5179281716019200-hp.gif";
undefined
> var url = require('url');
undefined
> url.resolve(base, imageSrc);
'https://www.google.com/logos/doodles/2016/phoebe-snetsingers-85th-birthday-5179281716019200-hp.gif'
> imageSrc = base + imageSrc;
'https://www.google.com/logos/doodles/2016/phoebe-snetsingers-85th-birthday-5179281716019200-hp.gif'
> url.resolve(base, imageSrc);
'https://www.google.com/logos/doodles/2016/phoebe-snetsingers-85th-birthday-5179281716019200-hp.gif'

Your code would change to something like:

var rp = require('request-promise'),
    cheerio = require('cheerio'),
    url = require('url'),
    base = 'http://www.google.com';

var options = {
    uri: base,
    method: 'GET',
    resolveWithFullResponse: true
};

rp(options)
  .then (function (response) {
    $ = cheerio.load(response.body);
    var relativeLinks = $("img");
    relativeLinks.each( function() {
        var link = $(this).attr('src');
        var fullImagePath = url.resolve(base, link); // should be absolute 
        console.log(link);
        if (link.startsWith('http')){
            console.log('abs');
        }
        else {
            console.log('rel');
        }
   });
});
like image 85
Michael Avatar answered Sep 28 '22 03:09

Michael


To get an array of image links in your scenario, you can use url.resolve to resolve relative src attributes of img tags with the request URL, resulting in an absolute URL. The array is passed to the final then; you can do other things with the array other than console.log if so desired.

var rp = require('request-promise'),
    cheerio = require('cheerio'),
    url = require('url'),
    base = 'http://www.google.com';

var options = {
    uri: base,
    method: 'GET',
    resolveWithFullResponse: true
};

rp(options)
    .then (function (response) {
        var $ = cheerio.load(response.body);

        return $('img').map(function () {
            return url.resolve(base, $(this).attr('src'));
        }).toArray();
    })
    .then(console.log);

This url.resolve will work for absolute or relative URLs (it resolves and returns the combined absolute URL when resolving from your request URL to a relative path, but when resolving from your request URL to an absolute URL it just returns the absolute URL). For example, with img tags on google with /logos/cat.gif and https://test.com/dog.gif as the src attributes, this would output:

[ 
    'http://www.google.com/logos/cat.gif',
    'https://test.com/dog.gif'
]
like image 41
Nick Bartlett Avatar answered Sep 28 '22 04:09

Nick Bartlett