Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert HTML page to plain text in node.js?

I know this has been asked before but I can't find a good answer for node.js

I need server-side to extract the plain text (no tags, script, etc.) from an HTML page that is fetched.

I know how to do it client-side with jQuery (get the .text() contents of the body tag), but do not know how to do this on the server side.

I've tried https://npmjs.org/package/html-to-text but this doesn't handle scripts.

  var htmlToText = require('html-to-text');
    var request = require('request');
    request.get(url, function (error, result) {
        var text = htmlToText.fromString(result.body, {
            wordwrap: 130
        });
    });

I've tried phantom.js but can't find a way to just get plain text.

like image 885
metalaureate Avatar asked Nov 14 '13 18:11

metalaureate


People also ask

How do I render plain HTML?

The res. sendFile() method of the express. js module is used to render a particular HTML file that is present in the local machine.


3 Answers

Use jsdom and jQuery (server-side).

With jQuery you can delete all scripts, styles, templates and the like and then you can extract the text.

Example

(This is not tested with jsdom and node, only in Chrome)

jQuery('script').remove()
jQuery('noscript').remove()
jQuery('body').text().replace(/\s{2,9999}/g, ' ')
like image 96
hgoebl Avatar answered Oct 15 '22 11:10

hgoebl


As another answer suggested, use JSDOM, but you don't need jQuery. Try this:

JSDOM.fragment(sourceHtml).textContent
like image 6
Brad Avatar answered Oct 15 '22 12:10

Brad


For those searching for a regex solution, here is my one

const HTMLPartToTextPart = (HTMLPart) => (
  HTMLPart
    .replace(/\n/ig, '')
    .replace(/<style[^>]*>[\s\S]*?<\/style[^>]*>/ig, '')
    .replace(/<head[^>]*>[\s\S]*?<\/head[^>]*>/ig, '')
    .replace(/<script[^>]*>[\s\S]*?<\/script[^>]*>/ig, '')
    .replace(/<\/\s*(?:p|div)>/ig, '\n')
    .replace(/<br[^>]*\/?>/ig, '\n')
    .replace(/<[^>]*>/ig, '')
    .replace('&nbsp;', ' ')
    .replace(/[^\S\r\n][^\S\r\n]+/ig, ' ')
);
like image 6
Poyoman Avatar answered Oct 15 '22 10:10

Poyoman