I know this has been asked before but I can't find a good answer for node.js I need server-side to extract the plain text (no tags, script, etc.) from an HTML page that is fetched. I know how to do it client-side with jQuery (get the .text() contents of the body tag), but do not know how to do this on the server side. I've tried https://npmjs.org/package/html-to-text but this doesn't handle scripts. <pre class="prettyprint"><code> var htmlToText = require('html-to-text'); var request = require('request'); request.get(url, function (error, result) { var text = htmlToText.fromString(result.body, { wordwrap: 130 }); }); </code></pre> I've tried phantom.js but can't find a way to just get plain text.

Use jsdom and jQuery (server-side). With jQuery you can delete all scripts, styles, templates and the like and then you can extract the text. Example (This is not tested with jsdom and node, only in Chrome) <pre class="prettyprint"><code>jQuery('script').remove() jQuery('noscript').remove() jQuery('body').text().replace(/\s{2,9999}/g, ' ') </code></pre>

As another answer suggested, use JSDOM, but you don't need jQuery. Try this: <pre class="prettyprint"><code>JSDOM.fragment(sourceHtml).textContent </code></pre>

For those searching for a regex solution, here is my one <pre class="prettyprint"><code>const HTMLPartToTextPart = (HTMLPart) => ( HTMLPart .replace(/\n/ig, '') .replace(/<style[^>]*>[\s\S]*?<\/style[^>]*>/ig, '') .replace(/<head[^>]*>[\s\S]*?<\/head[^>]*>/ig, '') .replace(/<script[^>]*>[\s\S]*?<\/script[^>]*>/ig, '') .replace(/<\/\s*(?:p|div)>/ig, '\n') .replace(/<br[^>]*\/?>/ig, '\n') .replace(/<[^>]*>/ig, '') .replace('&nbsp;', ' ') .replace(/[^\S\r\n][^\S\r\n]+/ig, ' ') ); </code></pre>

How to convert HTML page to plain text in node.js?

Tags:

javascript

node.js

screen-scraping

I know this has been asked before but I can't find a good answer for node.js

I need server-side to extract the plain text (no tags, script, etc.) from an HTML page that is fetched.

I know how to do it client-side with jQuery (get the .text() contents of the body tag), but do not know how to do this on the server side.

I've tried https://npmjs.org/package/html-to-text but this doesn't handle scripts.

  var htmlToText = require('html-to-text');
    var request = require('request');
    request.get(url, function (error, result) {
        var text = htmlToText.fromString(result.body, {
            wordwrap: 130
        });
    });

I've tried phantom.js but can't find a way to just get plain text.

885

asked Nov 14 '13 18:11

metalaureate

3 Answers

Use jsdom and jQuery (server-side).

With jQuery you can delete all scripts, styles, templates and the like and then you can extract the text.

Example

(This is not tested with jsdom and node, only in Chrome)

jQuery('script').remove()
jQuery('noscript').remove()
jQuery('body').text().replace(/\s{2,9999}/g, ' ')

answered Oct 15 '22 11:10

hgoebl

As another answer suggested, use JSDOM, but you don't need jQuery. Try this:

JSDOM.fragment(sourceHtml).textContent

answered Oct 15 '22 12:10

Brad

For those searching for a regex solution, here is my one

const HTMLPartToTextPart = (HTMLPart) => (
  HTMLPart
    .replace(/\n/ig, '')
    .replace(/<style[^>]*>[\s\S]*?<\/style[^>]*>/ig, '')
    .replace(/<head[^>]*>[\s\S]*?<\/head[^>]*>/ig, '')
    .replace(/<script[^>]*>[\s\S]*?<\/script[^>]*>/ig, '')
    .replace(/<\/\s*(?:p|div)>/ig, '\n')
    .replace(/<br[^>]*\/?>/ig, '\n')
    .replace(/<[^>]*>/ig, '')
    .replace('&nbsp;', ' ')
    .replace(/[^\S\r\n][^\S\r\n]+/ig, ' ')
);

answered Oct 15 '22 10:10

Poyoman

Related questions
                            
                                Javascript context menu click event/detection - filter paste content
                            
                                Node.js + Express.js. How to RENDER less css?
                            
                                monitoring history.pushstate from a chrome extension
                            
                                How do I loop through deeply nested properties of a JavaScript object?
                            
                                asp.net OnClientClick not rendered for initially disabled Button
                            
                                Disabling loading specific JavaScript files with Firefox
                            
                                Adding Event Listeners on Elements - Javascript
                            
                                How to return focus to the parent window using javascript?
                            
                                Is there a way to set a Web Worker to low priority?
                            
                                How to implement a "function timeout" in Javascript - not just the 'setTimeout'
                            
                                Intermittent RequireJS Load Error
                            
                                Chaining Promises recursively
                            
                                focus() input element with jQuery, but the cursor doesn't appear
                            
                                Why embed the JavaScript class in an anonymous function() call?
                            
                                Using the 'webpage' Phantom module in node.js
                            
                                Reload CSS stylesheets with javascript
                            
                                Javascript Asynchronous Exception Handling with node.js
                            
                                Minimum / Maximum absolute position in CSS
                            
                                How to hide/show nav bar when user scrolls up/down
                            
                                HTML5/javascript X11 server?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With