Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to remove all empty HTML tags

Tags:

html

regex

php

This is my PHP functions to remove all empty HTML tags from string input:

/**
 * Remove the nested HTML empty tags from the string.
 *
 * @param $string String to remove tags
 * @param null $replaceTo Replace empty string with
 * @return mixed Cleaned string
 */
function crl_remove_empty_tags($string, $replaceTo = null)
{
    // Return if string not given or empty
    if (!is_string($string) || trim($string) == '') return $string;

    // Recursive empty HTML tags
    return preg_replace(
        '/<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*/?>\s*</\1\s*>/gixsm',
        !is_string($replaceTo) ? '' : $replaceTo,
        $string
    );
}

My regex: /<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*/?>\s*</\1\s*>/gixsm

I tested it with http://gskinner.com/RegExr/ and http://regexpal.com/, it worked well. But when I tried to run it. Server always returned the error:

Warning: preg_replace(): Unknown modifier '\'

I have no idea what excactly '\' goes wrong. Someone please help me out!

like image 928
Manhhailua Avatar asked Jan 10 '14 18:01

Manhhailua


3 Answers

In php regular expressions you need to escape your delimiters if they occur literally within your expression.

In your case, you have two unescaped /; simply replace them with \/. You also don't need the array of modifiers -- php is global by default, and you have no literal word characters defined.

Before:

/<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*/?>\s*</\1\s*>/gixsm

After:

/<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*\/?>\s*<\/\1\s*>/
//                                                                    ^       ^
like image 55
brandonscript Avatar answered Oct 23 '22 10:10

brandonscript


This pattern is able to remove "empty tags" (i.e. non self-closing tags where that contain nothing, white-spaces, html comments or other "empty tags"), even if these tags are nested like <span><span></span></span>. Tags inside html comments are not taken in account:

$pattern = <<<'EOD'
~
<
(?:
    !--[^-]*(?:-(?!->)[^-]*)*-->[^<]*(*SKIP)(*F) # skip comments
  |
    ( # group 1
        (\w++)     # tag name in group 2
        [^"'>]* #'"# all that is not a quote or a closing angle bracket
        (?: # quoted attributes
            "[^\\"]*(?:\\.[^\\"]*)*+" [^"'>]* #'"# double quote
          |
            '[^\\']*(?:\\.[^\\']*)*+' [^"'>]* #'"# single quote
        )*+
        >
        \s*
        (?:
            <!--[^-]*(?:-(?!->)[^-]*)*+--> \s* # html comments
          |
            <(?1) \s*                          # recursion with the group 1
        )*+
        </\2> # closing tag
    ) # end of the group 1
)
~sxi
EOD;

$html = preg_replace($pattern, '', $html);

Limitations:

  • This approach will remove links to external Javascript files:
    <script src="myscript.js"></script>
  • The pattern may remove part of embedded Javascript code if something like:
    var myvar="<span></span>";
    or like:
    var myvar1="<span><!--"; function doSomething() { alert("!!!"); } var myvar2="--></span>";
    is found.

These limitations are due to the fact that a basic text approach is not able to make the difference between html and javascript code. However, it is possible to solve this problem if you add "script" tags in the pattern skip list (in the same way than html comments), but in this case you need to basically describe the Javascript content (strings, comments, literal patterns, all that is not the previous three) that isn't a trivial task but possible.

like image 4
Casimir et Hippolyte Avatar answered Oct 23 '22 10:10

Casimir et Hippolyte


Remove empty elements... and the next empty elements.

P.e.

<p>Hello!
   <div class="foo"><p id="nobody">
   </p>
      </div>
 </p>

Results:

<p>Hello!</p>

Php code:

/* $html store the html content */
do {
    $tmp = $html;
    $html = preg_replace( '#<([^ >]+)[^>]*>([[:space:]]|&nbsp;)*</\1>#', '', $html );
} while ( $html !== $tmp );
like image 3
Alejandro Salamanca Mazuelo Avatar answered Oct 23 '22 10:10

Alejandro Salamanca Mazuelo