Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Node JS grab the first image in an html string

I'm trying to grab the first image in an html string like this one

  <table border="0" cellpadding="2" cellspacing="7" style="vertical-align:top;"><tr><td width="80" align="center" valign="top"><font style="font-size:85%;font-family:arial,sans-serif"><a href="http://news.google.com/news/url?sa=t&amp;fd=R&amp;ct2=us&amp;usg=AFQjCNFfn6RXQ3v898sGY_-sFLGCJ4EV5Q&amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;cid=52778551504048&amp;ei=zfK5U7D4JoLi1Ab0wIHwDw&amp;url=http://online.wsj.com/articles/obamas-letters-to-corinthian-1404684555"><img src="//t3.gstatic.com/images?q=tbn:ANd9GcQVyQsQJvKMgXHEX9riJuZKWav5U1nI-jdB-i1HwFYQ-7jGvGrbk9N_k0XEDMVH-HAbLxP1wrU" alt="" border="1" width="80" height="80" /><br /><font size="-2">Wall Street Journal</font></a></font></td><td valign="top" class="j"><font style="font-size:85%;font-family:arial,sans-serif"><br /><div style="padding-top:0.8em;"><img alt="" height="1" width="1" /></div><div class="lh"><a href="http://news.google.com/news/url?sa=t&amp;fd=R&amp;ct2=us&amp;usg=AFQjCNFfn6RXQ3v898sGY_-sFLGCJ4EV5Q&amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;cid=52778551504048&amp;ei=zfK5U7D4JoLi1Ab0wIHwDw&amp;url=http://online.wsj.com/articles/obamas-letters-to-corinthian-1404684555"><b><b>Obama&#39;s</b> Letters to Corinthian</b></a><br /><font size="-1"><b><font color="#6f6f6f">Wall Street Journal</font></b></font><br /><font size="-1">The <b>Obama</b> Administration has targeted for-profit colleges as if they are enemy combatants. And now it has succeeded in putting out of business Santa Ana-based Corinthian Colleges for a dilatory response to document requests. Does the White House plan&nbsp;...</font><br /><font size="-1" class="p"></font><br /><font class="p" size="-1"><a class="p" href="http://news.google.com/news/more?ncl=dPkBozywrsIXKoM&amp;authuser=0&amp;ned=us"><nobr><b>and more&nbsp;&raquo;</b></nobr></a></font></div></font></td></tr></table>

here is the tag of the image

<img src="//t3.gstatic.com/images?q=tbn:ANd9GcQVyQsQJvKMgXHEX9riJuZKWav5U1nI-jdB-i1HwFYQ-7jGvGrbk9N_k0XEDMVH-HAbLxP1wrU" alt="" border="1" width="80" height="80">

every images has got this kind of url //tx.gstatic.com where x is a number i think between 0<x<3

That's what I do without success and I don't understand why this happen

      var re = /<img[^>]+src="?([^"\s]+)"?\s*\/>/g;
      var results = re.exec(HTMLSTRING);
      var img="";
      if(results!=null && results.length!=0) img = results[0];
like image 467
Usi Usi Avatar asked Feb 12 '23 13:02

Usi Usi


1 Answers

The regular expression you provide indeed is not general enough to capture your <img> tag.

There are two options:

  • Make a better regular expression. This way lies madness. But in this case, it is sufficient to add the possibility of other attributes after src:

    var re = /<img[^>]+src="?([^"\s]+)"?[^>]*\/>/g;
    var results = re.exec(HTMLSTRING);
    var img="";
    if(results) img = results[1];
    

    Note [^>]* replacing your \s*, and also note results[1] instead of results[0] if you want the source and not the tag itself.

  • Use a DOM parser to handle DOM. This is the easy path.

    var jsdom = require("jsdom");
    var img_sources = jsdom.env(
      HTMLSTRING,
      function (errors, window) {
        var imgs = window.document.getElementsByTagName('img');
        for (var i = 0; i < imgs.length; i++) {
          var src = imgs[i].getAttribute('src');
          if (src) console.log(src);
        }
      }
    );
    
like image 193
Amadan Avatar answered Feb 15 '23 10:02

Amadan