Most questions about extracting text from HTML (i.e., stripping the tags) use:
jQuery( htmlString ).text();
While this abstracts browser inconsistencies (such as innerText
vs. textContent
), the function call also ignores the semantic meaning of block-level elements (such as li
).
Preserving newlines of block-level elements (i.e., the semantic intent) across various browsers entails no small effort, as Mike Wilcox describes.
A seemingly simpler solution would be to emulate pasting HTML content into a <textarea>
, which strips HTML while preserving block-level element newlines. However, JavaScript-based inserts do not trigger the same HTML-to-text routines that browsers employ when users paste content into a <textarea>
.
I also tried integrating Mike Wilcox's JavaScript code. The code works in Chromium, but not in Firefox.
What is the simplest cross-browser way to extract text from HTML while preserving semantic newlines for block-level elements using jQuery (or vanilla JavaScript)?
Consider:
The textarea preserves the newlines for ordered lists, headings, preformatted text, and so forth. That is the result I would like to achieve.
To further clarify, given any HTML content, such as:
<h1>Header</h1> <p>Paragraph</p> <ul> <li>First</li> <li>Second</li> </ul> <dl> <dt>Term</dt> <dd>Definition</dd> </dl> <div>Div with <span>span</span>.<br />After the <a href="...">break</a>.</div>
How would you produce:
Header Paragraph First Second Term Definition Div with span. After the break.
Note: Neither indentation nor non-normalized whitespace are relevant.
Consider:
/** * Returns the style for a node. * * @param n The node to check. * @param p The property to retrieve (usually 'display'). * @link http://www.quirksmode.org/dom/getstyles.html */ this.getStyle = function( n, p ) { return n.currentStyle ? n.currentStyle[p] : document.defaultView.getComputedStyle(n, null).getPropertyValue(p); } /** * Converts HTML to text, preserving semantic newlines for block-level * elements. * * @param node - The HTML node to perform text extraction. */ this.toText = function( node ) { var result = ''; if( node.nodeType == document.TEXT_NODE ) { // Replace repeated spaces, newlines, and tabs with a single space. result = node.nodeValue.replace( /\s+/g, ' ' ); } else { for( var i = 0, j = node.childNodes.length; i < j; i++ ) { result += _this.toText( node.childNodes[i] ); } var d = _this.getStyle( node, 'display' ); if( d.match( /^block/ ) || d.match( /list/ ) || d.match( /row/ ) || node.tagName == 'BR' || node.tagName == 'HR' ) { result += '\n'; } } return result; }
http://jsfiddle.net/3mzrV/2/
That is to say, with an exception or two, iterate through each node and print its contents, letting the browser's computed style tell you when to insert newlines.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With