Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse HTML/XML documents with Node.js?

I have an editor.html that contains generatePNG function:

  <!DOCTYPE html> 
<html> 
<head> 
    <meta charset="UTF-8"> 
    <title>Diagram</title> 

    <script type="text/javascript" src="lib/jquery-1.8.1.js"></script> 
//    <!-- I use many resources -->
<script></script> 

    <script> 

        function generatePNG (oViewer) { 
            var oImageOptions = { 
                includeDecoratorLayers: false, 
                replaceImageURL: true 
            }; 

            var d = new Date(); 
            var h = d.getHours(); 
            var m = d.getMinutes(); 
            var s = d.getSeconds(); 

            var sFileName = "diagram" + h.toString() + m.toString() + s.toString() + ".png"; 

            var sResultBlob = oViewer.generateImageBlob(function(sBlob) { 
                b = 64; 
                var reader = new window.FileReader(); 
                reader.readAsDataURL(sBlob); 
                reader.onloadend = function() { 
                    base64data = reader.result; 
                    var image = document.createElement('img'); 
                    image.setAttribute("id", "GraphImage"); 
                    image.src = base64data; 
                    document.body.appendChild(image); 
                } 

            }, "image/png", oImageOptions); 
            return sResult; 
        } 

    </script> 


</head> 

<body > 
    <div id="diagramContainer"></div> 
</body> 
</html>

I want to access the DOM and get image.src using Node.js. I find that I can work with cheerio or jsdom.

I start with this:

var cheerio = require('cheerio'),
    $ = cheerio.load('editor.html');

But I don't find how to access and get image.src.

like image 688
ameni Avatar asked Dec 16 '15 10:12

ameni


1 Answers

The problem is that by loading an html file into cheerio (or any other node module) will not process the HTML as a browser does. Assets (such as stylesheets, images and javascripts) will not be loaded and/or processed as they would be within a browser.

While both node.js and modern webbrowsers have the same (or similar) javascript engines, however a browser adds a lot of additional stuff, such as window, the DOM (document), etc. Node.js does not have these concepts, so there is no window.FileReader nor document.createElement.

If the image is created entirely without user interaction (your code sample 'magically' receives the sBlob argument wich appears to be a string like data:<type>;<encoding>,<data>) you could use a so called headless browser on the server, PhantomJS seems most popular these days. Then again, if no user interaction is required for the creation of the sBlob, you are probably better off using a pure node.js solution, e.g. How do I parse a data URL in Node?.

If there is some kind of user interaction required to create the sBlob, and you need to store it on a server, you can use pretty much the same solution as mentioned by simply sending the sBlob to the server using Ajax or a websocket, processing the sBlob into an image and (optionally) returning the URL where to find the image.

like image 101
Rogier Spieker Avatar answered Nov 15 '22 22:11

Rogier Spieker