I have been using CKEditor wysiwyg editor for a website where users are allowed to use the HTML editor to add some comments. I ended up having some extremely redundant nested HTML code in my database that is slowing down the viewing/editing of these comments. I have comments that look like this (this is a very small example. I have comments with over 100 nested tags): <pre class="prettyprint"><code> This is a Test </code></pre> My questions are: <ul> <li>Is there any library/code/software that can do a smart (i.e. format-aware) clean-up of the HTML code, removing all redundant tags that have no effect on the formatting (because they're overridden by inner tags) ? I've tried many existing online solutions (such as HTML Tidy). None of them do what I want.</li> <li>If not, I'll need to write some code for HTML parsing and cleaning. I am planning to use PHP Simple HTML DOM to traverse the HTML tree and find all tags that have no effect. Do you suggest any other HTML parser that is more suitable for my purpose?</li> </ul> Thanks . <h3>Update:</h3> I have written some code to analyze the HTML code that I have. All the HTML tags that I have are: <ul> <li> <code></code> with styles for <code>font-size</code> and/or <code>color</code> </li> <li> <code></code> with attributes <code>color</code> and/or <code>size</code> </li> <li> <code><a></code> for links (with <code>href</code>)</li> <li><code></code></li> <li> <code></code> (single tag to wrap the whole comment) </li> <li><code></code></li> </ul> I can easily write some code to convert the HTML code into bbcode (e.g. <code>[b]</code>, <code>[color=blue]</code>, <code>[size=3]</code>, etc). So I above HTML will become something like: <pre class="prettyprint"><code>[b][size=14][color=#006400][size=14][size=16][color=#006400] [size=14][size=16][color=#006400]This is a [/color][/size] [/size][/color][/size][/size][color=#006400][size=16] [color=#b22222]Test[/color][/size][/color][/color][/size][/b] </code></pre> The question now is: Is there an easy way (algorithm/library/etc) to clean-up the messy (as messy as that original HTML) bbcode that will be generated? thanks again

<h3>Introduction</h3> The best solution have seen so far is using <code>HTML Tidy</code> http://tidy.sourceforge.net/ <blockquote> Beyond converting the format of a document, Tidy is also able to convert deprecated HTML tags into their cascading style sheet (CSS) counterparts automatically through the use of the clean option. The generated output contains an inline style declaration. </blockquote> It also ensures that the HTML document is <code>xhtml</code> compatible <h3>Example</h3> <pre class="prettyprint"><code>$code =' This is a Test '; </code></pre> If you RUN <pre class="prettyprint"><code>$clean = cleaning($code); print($clean['body']); </code></pre> Output <pre class="prettyprint"><code> This is a Test </code></pre> You can get the CSS <pre class="prettyprint"><code>$clean = cleaning($code); print($clean['style']); </code></pre> Output <pre class="prettyprint"><code><style type="text/css"> span.c3 { font-size: 14px } span.c2 { color: #006400; font-size: 16px } span.c1 { color: #006400; font-size: 14px } </style> </code></pre> Our the FULL HTML <pre class="prettyprint"><code>$clean = cleaning($code); print($clean['full']); </code></pre> Output <pre class="prettyprint"><code><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title></title> <style type="text/css"> /*<![CDATA[*/ span.c3 {font-size: 14px} span.c2 {color: #006400; font-size: 16px} span.c1 {color: #006400; font-size: 14px} /*]]>*/ </style> </head> <body> This is a Test </body> </html> </code></pre> <h3>Function Used</h3> <pre class="prettyprint"><code>function cleaning($string, $tidyConfig = null) { $out = array (); $config = array ( 'indent' => true, 'show-body-only' => false, 'clean' => true, 'output-xhtml' => true, 'preserve-entities' => true ); if ($tidyConfig == null) { $tidyConfig = &$config; } $tidy = new tidy (); $out ['full'] = $tidy->repairString ( $string, $tidyConfig, 'UTF8' ); unset ( $tidy ); unset ( $tidyConfig ); $out ['body'] = preg_replace ( "/.*<body[^>]*>|<\/body>.*/si", "", $out ['full'] ); $out ['style'] = '<style type="text/css">' . preg_replace ( "/.*<style[^>]*>|<\/style>.*/si", "", $out ['full'] ) . '</style>'; return ($out); } </code></pre> ================================================ <h3> Edit 1 : Dirty Hack (Not Recommended)</h3> ================================================ Based on your last comment its like you want to retain the depreciate style .. <code>HTML Tidy</code> might not allow you to do that since its <code>depreciated</code> but you can do this <pre class="prettyprint"><code>$out = cleaning ( $code ); $getStyle = new css2string (); $getStyle->parseStr ( $out ['style'] ); $body = $out ['body']; $search = array (); $replace = array (); foreach ( $getStyle->css as $key => $value ) { list ( $selector, $name ) = explode ( ".", $key ); $search [] = "<$selector class=\"$name\">"; $style = array (); foreach ( $value as $type => $att ) { $style [] = "$type:$att"; } $replace [] = "<$selector style=\"" . implode ( ";", $style ) . ";\">"; } </code></pre> Output <pre class="prettyprint"><code> This is a Test </code></pre> <h3> Class Used </h3> <pre class="prettyprint"><code>//Credit : http://stackoverflow.com/a/8511837/1226894 class css2string { var $css; function parseStr($string) { preg_match_all ( '/(?ims)([a-z0-9, \s\.\:#_\-@]+)\{([^\}]*)\}/', $string, $arr ); $this->css = array (); foreach ( $arr [0] as $i => $x ) { $selector = trim ( $arr [1] [$i] ); $rules = explode ( ';', trim ( $arr [2] [$i] ) ); $this->css [$selector] = array (); foreach ( $rules as $strRule ) { if (! empty ( $strRule )) { $rule = explode ( ":", $strRule ); $this->css [$selector] [trim ( $rule [0] )] = trim ( $rule [1] ); } } } } function arrayImplode($glue, $separator, $array) { if (! is_array ( $array )) return $array; $styleString = array (); foreach ( $array as $key => $val ) { if (is_array ( $val )) $val = implode ( ',', $val ); $styleString [] = "{$key}{$glue}{$val}"; } return implode ( $separator, $styleString ); } function getSelector($selectorName) { return $this->arrayImplode ( ":", ";", $this->css [$selectorName] ); } } </code></pre>

Cleaning HTML by removing extra/redundant formatting tags

Tags:

html

dom

php

html-parsing

bbcode

I have been using CKEditor wysiwyg editor for a website where users are allowed to use the HTML editor to add some comments. I ended up having some extremely redundant nested HTML code in my database that is slowing down the viewing/editing of these comments.

I have comments that look like this (this is a very small example. I have comments with over 100 nested tags):

<p>  <strong>   <span style="font-size: 14px">    <span style="color: #006400">      <span style="font-size: 14px">       <span style="font-size: 16px">        <span style="color: #006400">         <span style="font-size: 14px">          <span style="font-size: 16px">           <span style="color: #006400">This is a </span>          </span>         </span>        </span>       </span>      </span>     </span>     <span style="color: #006400">      <span style="font-size: 16px">       <span style="color: #b22222">Test</span>      </span>     </span>    </span>   </span>  </strong> </p>

My questions are:

Is there any library/code/software that can do a smart (i.e. format-aware) clean-up of the HTML code, removing all redundant tags that have no effect on the formatting (because they're overridden by inner tags) ? I've tried many existing online solutions (such as HTML Tidy). None of them do what I want.
If not, I'll need to write some code for HTML parsing and cleaning. I am planning to use PHP Simple HTML DOM to traverse the HTML tree and find all tags that have no effect. Do you suggest any other HTML parser that is more suitable for my purpose?

Thanks

Update:

I have written some code to analyze the HTML code that I have. All the HTML tags that I have are:

 with styles for font-size and/or color
 with attributes color and/or size
<a> for links (with href)

 (single tag to wrap the whole comment)

I can easily write some code to convert the HTML code into bbcode (e.g. [b], [color=blue], [size=3], etc). So I above HTML will become something like:

[b][size=14][color=#006400][size=14][size=16][color=#006400] [size=14][size=16][color=#006400]This is a [/color][/size] [/size][/color][/size][/size][color=#006400][size=16] [color=#b22222]Test[/color][/size][/color][/color][/size][/b]

The question now is: Is there an easy way (algorithm/library/etc) to clean-up the messy (as messy as that original HTML) bbcode that will be generated?

thanks again

567

asked Apr 20 '12 14:04

Aziz

2 Answers

Introduction

The best solution have seen so far is using HTML Tidy http://tidy.sourceforge.net/

Beyond converting the format of a document, Tidy is also able to convert deprecated HTML tags into their cascading style sheet (CSS) counterparts automatically through the use of the clean option. The generated output contains an inline style declaration.

It also ensures that the HTML document is xhtml compatible

Example

$code ='<p>  <strong>   <span style="font-size: 14px">    <span style="color: #006400">      <span style="font-size: 14px">       <span style="font-size: 16px">        <span style="color: #006400">         <span style="font-size: 14px">          <span style="font-size: 16px">           <span style="color: #006400">This is a </span>          </span>         </span>        </span>       </span>      </span>     </span>     <span style="color: #006400">      <span style="font-size: 16px">       <span style="color: #b22222">Test</span>      </span>     </span>    </span>   </span>  </strong> </p>';

If you RUN

$clean = cleaning($code); print($clean['body']);

Output

<p>     <strong>         <span class="c3">             <span class="c1">This is a</span>                  <span class="c2">Test</span>             </span>         </strong> </p>

You can get the CSS

$clean = cleaning($code); print($clean['style']);

Output

<style type="text/css">     span.c3 {         font-size: 14px     }      span.c2 {         color: #006400;         font-size: 16px     }      span.c1 {         color: #006400;         font-size: 14px     } </style>

Our the FULL HTML

$clean = cleaning($code); print($clean['full']);

Output

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml">   <head>     <title></title>     <style type="text/css"> /*<![CDATA[*/     span.c3 {font-size: 14px}     span.c2 {color: #006400; font-size: 16px}     span.c1 {color: #006400; font-size: 14px}     /*]]>*/     </style>   </head>   <body>     <p>       <strong><span class="c3"><span class="c1">This is a</span>       <span class="c2">Test</span></span></strong>     </p>   </body> </html>

Function Used

function cleaning($string, $tidyConfig = null) {     $out = array ();     $config = array (             'indent' => true,             'show-body-only' => false,             'clean' => true,             'output-xhtml' => true,             'preserve-entities' => true      );     if ($tidyConfig == null) {         $tidyConfig = &$config;     }     $tidy = new tidy ();     $out ['full'] = $tidy->repairString ( $string, $tidyConfig, 'UTF8' );     unset ( $tidy );     unset ( $tidyConfig );     $out ['body'] = preg_replace ( "/.*<body[^>]*>|<\/body>.*/si", "", $out ['full'] );     $out ['style'] = '<style type="text/css">' . preg_replace ( "/.*<style[^>]*>|<\/style>.*/si", "", $out ['full'] ) . '</style>';     return ($out); }

================================================

Edit 1 : Dirty Hack (Not Recommended)

================================================

Based on your last comment its like you want to retain the depreciate style .. HTML Tidy might not allow you to do that since its depreciated but you can do this

$out = cleaning ( $code ); $getStyle = new css2string (); $getStyle->parseStr ( $out ['style'] ); $body = $out ['body']; $search = array (); $replace = array ();  foreach ( $getStyle->css as $key => $value ) {     list ( $selector, $name ) = explode ( ".", $key );     $search [] = "<$selector class=\"$name\">";     $style = array ();     foreach ( $value as $type => $att ) {         $style [] = "$type:$att";     }     $replace [] = "<$selector style=\"" . implode ( ";", $style ) . ";\">"; }

Output

<p>   <strong>       <span style="font-size:14px;">         <span style="color:#006400;font-size:14px;">This is a</span>         <span style="color:#006400;font-size:16px;">Test</span>         </span>   </strong> </p>

Class Used

//Credit : http://stackoverflow.com/a/8511837/1226894 class css2string { var $css;  function parseStr($string) {     preg_match_all ( '/(?ims)([a-z0-9, \s\.\:#_\-@]+)\{([^\}]*)\}/', $string, $arr );     $this->css = array ();     foreach ( $arr [0] as $i => $x ) {         $selector = trim ( $arr [1] [$i] );         $rules = explode ( ';', trim ( $arr [2] [$i] ) );         $this->css [$selector] = array ();         foreach ( $rules as $strRule ) {             if (! empty ( $strRule )) {                 $rule = explode ( ":", $strRule );                 $this->css [$selector] [trim ( $rule [0] )] = trim ( $rule [1] );             }         }     } }  function arrayImplode($glue, $separator, $array) {     if (! is_array ( $array ))         return $array;     $styleString = array ();     foreach ( $array as $key => $val ) {         if (is_array ( $val ))             $val = implode ( ',', $val );         $styleString [] = "{$key}{$glue}{$val}";      }     return implode ( $separator, $styleString ); }  function getSelector($selectorName) {     return $this->arrayImplode ( ":", ";", $this->css [$selectorName] ); }  }

answered Sep 17 '22 21:09

Baba

Here is a solution that uses the browser to get the nested element's properties. No need to cascade the properties up, since the css computed styles is ready to read from the browser.

Here is an example: http://jsfiddle.net/mmeah/fUpe8/3/

var fixedCode = readNestProp($("#redo")); $("#simp").html( fixedCode );  function readNestProp(el){  var output = "";  $(el).children().each( function(){     if($(this).children().length==0){         var _that=this;         var _cssAttributeNames = ["font-size","color"];         var _tag = $(_that).prop("nodeName").toLowerCase();         var _text = $(_that).text();         var _style = "";         $.each(_cssAttributeNames, function(_index,_value){             var css_value = $(_that).css(_value);             if(typeof css_value!= "undefined"){                 _style += _value + ":";                 _style += css_value + ";";             }         });         output += "<"+_tag+" style='"+_style+"'>"+_text+"</"+_tag+">";     }else if(         $(this).prop("nodeName").toLowerCase() !=         $(this).find(">:first-child").prop("nodeName").toLowerCase()     ){         var _tag = $(this).prop("nodeName").toLowerCase();         output += "<"+_tag+">" + readNestProp(this) + "</"+_tag+">";     }else{         output += readNestProp(this);     };  });  return output; }

A better solution to typing in all possible css attributes like:
var _cssAttributeNames = ["font-size","color"];
Is to use a solution like mentioned here: Can jQuery get all CSS styles associated with an element?

answered Sep 19 '22 21:09

MMeah

Related questions
                            
                                On Handsontable, if there is similar columns header, then the first columns cell value is auto copied to other similar cell
                            
                                Best approach to model validation in PHP? [closed]
                            
                                ZF2 - Mocking service requested in Module.php
                            
                                When should I use closeCursor() for PDO statements?
                            
                                "How the sausage is made" tour of apache/php/mysql interaction
                            
                                Adding values to a magento dropdown or multi-select product attribute while adding a new product
                            
                                The dreaded 'keep me logged in' and session checking
                            
                                Help me improve my continuous deployment workflow
                            
                                array_map vs loop and operation
                            
                                How to install Imagick/imagemagick PHP extension on windows 7
                            
                                PHP: Week starts on Monday, but "monday this week" on a Sunday gets Monday next week
                            
                                How to know which radio button is selected in jquery? [duplicate]
                            
                                Xdebug stopped working, where do I look for errors?
                            
                                PHP Character Iteration In For Loop Issue [duplicate]
                            
                                Is there a PHP syntax checker for Notepad++?
                            
                                What is Options +FollowSymLinks?
                            
                                PDO and MariaDB
                            
                                Web Hosting on Amazon AWS (PHP + MySQL)
                            
                                Call methods of objects in array using array_map?
                            
                                Fatal error: Call to undefined function sqlsrv_connect()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Cleaning HTML by removing extra/redundant formatting tags

Tags:

html

dom

php

html-parsing

bbcode

Update:

Aziz

People also ask

2 Answers

Introduction

Example

Function Used

Edit 1 : Dirty Hack (Not Recommended)

Class Used

Baba

MMeah

Recent Activity

Donate For Us