Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract text from HTML while preserving block-level element newlines

Background

Most questions about extracting text from HTML (i.e., stripping the tags) use:

jQuery( htmlString ).text(); 

While this abstracts browser inconsistencies (such as innerText vs. textContent), the function call also ignores the semantic meaning of block-level elements (such as li).

Problem

Preserving newlines of block-level elements (i.e., the semantic intent) across various browsers entails no small effort, as Mike Wilcox describes.

A seemingly simpler solution would be to emulate pasting HTML content into a <textarea>, which strips HTML while preserving block-level element newlines. However, JavaScript-based inserts do not trigger the same HTML-to-text routines that browsers employ when users paste content into a <textarea>.

I also tried integrating Mike Wilcox's JavaScript code. The code works in Chromium, but not in Firefox.

Question

What is the simplest cross-browser way to extract text from HTML while preserving semantic newlines for block-level elements using jQuery (or vanilla JavaScript)?

Example

Consider:

  1. Select and copy this entire question.
  2. Open the textarea example page.
  3. Paste the content into the textarea.

The textarea preserves the newlines for ordered lists, headings, preformatted text, and so forth. That is the result I would like to achieve.

To further clarify, given any HTML content, such as:

   <h1>Header</h1>    <p>Paragraph</p>    <ul>      <li>First</li>      <li>Second</li>    </ul>    <dl>      <dt>Term</dt>        <dd>Definition</dd>    </dl>    <div>Div with <span>span</span>.<br />After the <a href="...">break</a>.</div> 

How would you produce:

   Header   Paragraph      First     Second    Term     Definition    Div with span.   After the break. 

Note: Neither indentation nor non-normalized whitespace are relevant.

like image 759
Dave Jarvis Avatar asked Dec 04 '13 02:12

Dave Jarvis


1 Answers

Consider:

/**  * Returns the style for a node.  *  * @param n The node to check.  * @param p The property to retrieve (usually 'display').  * @link http://www.quirksmode.org/dom/getstyles.html  */ this.getStyle = function( n, p ) {   return n.currentStyle ?     n.currentStyle[p] :     document.defaultView.getComputedStyle(n, null).getPropertyValue(p); }  /**  * Converts HTML to text, preserving semantic newlines for block-level  * elements.  *  * @param node - The HTML node to perform text extraction.  */ this.toText = function( node ) {   var result = '';    if( node.nodeType == document.TEXT_NODE ) {     // Replace repeated spaces, newlines, and tabs with a single space.     result = node.nodeValue.replace( /\s+/g, ' ' );   }   else {     for( var i = 0, j = node.childNodes.length; i < j; i++ ) {       result += _this.toText( node.childNodes[i] );     }      var d = _this.getStyle( node, 'display' );      if( d.match( /^block/ ) || d.match( /list/ ) || d.match( /row/ ) ||         node.tagName == 'BR' || node.tagName == 'HR' ) {       result += '\n';     }   }    return result; } 

http://jsfiddle.net/3mzrV/2/

That is to say, with an exception or two, iterate through each node and print its contents, letting the browser's computed style tell you when to insert newlines.

like image 96
svidgen Avatar answered Oct 07 '22 17:10

svidgen