I have a CouchDB view map function that generates an abstract of a stored HTML document (first x
characters of text). Unfortunately I have no browser environment to convert HTML to plain text.
Currently I use this multi-stage regexp
html.replace(/<style([\s\S]*?)<\/style>/gi, ' ') .replace(/<script([\s\S]*?)<\/script>/gi, ' ') .replace(/(<(?:.|\n)*?>)/gm, ' ') .replace(/\s+/gm, ' ');
while it's a very good filter, it's obviously not a perfect one and some leftovers slip through sometimes. Is there a better way to convert to plain text without a browser environment?
This is the most efficient way of doing the task. Create a dummy element and assign it to a variable. We can extract later using the element objects. Assign the HTML text to innerHTML of the dummy element and we will get the plain text from the text element objects.
Just call the method html2text with passing the html text and it will return plain text.
You can show HTML tags as plain text in HTML on a website or webpage by replacing < with < or &60; and > with > or &62; on each HTML tag that you want to be visible. Ordinarily, HTML tags are not visible to the reader on the browser.
Insert your HTML text into the text box by typing it or cut and paste. Then to convert it to JavaScript that is usable in an HTML document, click the 'Convert HTML -> JavaScript' button; the converted code will appear in the same box. The 'Clear Text' button will erase everything in the text box.
This simple regular expression works:
text.replace(/<[^>]*>/g, '');
It removes all anchors.
Entities, like <
does not contains <, so there is no issue with this regex.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With