Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using PHP substr() and strip_tags() while retaining formatting and without breaking HTML

I have various HTML strings to cut to 100 characters (of the stripped content, not the original) without stripping tags and without breaking HTML.

Original HTML string (288 characters):

$content = "<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div over <div class='nestedDivClass'>there</div>
</div> and a lot of other nested <strong><em>texts</em> and tags in the air
<span>everywhere</span>, it's a HTML taggy kind of day.</strong></div>";

Standard trim: Trim to 100 characters and HTML breaks, stripped content comes to ~40 characters:

$content = substr($content, 0, 100)."..."; /* output:
<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div ove... */

Stripped HTML: Outputs correct character count but obviously looses formatting:

$content = substr(strip_tags($content)), 0, 100)."..."; /* output:
With a span over here and a nested div over there and a lot of other nested
texts and tags in the ai... */

Partial solution: using HTML Tidy or purifier to close off tags outputs clean HTML but 100 characters of HTML not displayed content.

$content = substr($content, 0, 100)."...";
$tidy = new tidy; $tidy->parseString($content); $tidy->cleanRepair(); /* output:
<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div ove</div></div>... */

Challenge: To output clean HTML and n characters (excluding character count of HTML elements):

$content = cutHTML($content, 100); /* output:
<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div over <div class='nestedDivClass'>there</div>
</div> and a lot of other nested <strong><em>texts</em> and tags in the
ai</strong></div>...";

Similar Questions

  • How to clip HTML fragments without breaking up tags
  • Cutting HTML strings without breaking HTML tags
like image 487
Peter Craig Avatar asked Mar 08 '10 01:03

Peter Craig


4 Answers

Not amazing, but works.

function html_cut($text, $max_length)
{
    $tags   = array();
    $result = "";

    $is_open   = false;
    $grab_open = false;
    $is_close  = false;
    $in_double_quotes = false;
    $in_single_quotes = false;
    $tag = "";

    $i = 0;
    $stripped = 0;

    $stripped_text = strip_tags($text);

    while ($i < strlen($text) && $stripped < strlen($stripped_text) && $stripped < $max_length)
    {
        $symbol  = $text{$i};
        $result .= $symbol;

        switch ($symbol)
        {
           case '<':
                $is_open   = true;
                $grab_open = true;
                break;

           case '"':
               if ($in_double_quotes)
                   $in_double_quotes = false;
               else
                   $in_double_quotes = true;

            break;

            case "'":
              if ($in_single_quotes)
                  $in_single_quotes = false;
              else
                  $in_single_quotes = true;

            break;

            case '/':
                if ($is_open && !$in_double_quotes && !$in_single_quotes)
                {
                    $is_close  = true;
                    $is_open   = false;
                    $grab_open = false;
                }

                break;

            case ' ':
                if ($is_open)
                    $grab_open = false;
                else
                    $stripped++;

                break;

            case '>':
                if ($is_open)
                {
                    $is_open   = false;
                    $grab_open = false;
                    array_push($tags, $tag);
                    $tag = "";
                }
                else if ($is_close)
                {
                    $is_close = false;
                    array_pop($tags);
                    $tag = "";
                }

                break;

            default:
                if ($grab_open || $is_close)
                    $tag .= $symbol;

                if (!$is_open && !$is_close)
                    $stripped++;
        }

        $i++;
    }

    while ($tags)
        $result .= "</".array_pop($tags).">";

    return $result;
}

Usage example:

$content = html_cut($content, 100);
like image 126
code ex machina Avatar answered Oct 10 '22 23:10

code ex machina


I'm not claiming to have invented this, but there is a very complete Text::truncate() method in CakePHP which does what you want:

function truncate($text, $length = 100, $ending = '...', $exact = true, $considerHtml = false) {
    if (is_array($ending)) {
        extract($ending);
    }
    if ($considerHtml) {
        if (mb_strlen(preg_replace('/<.*?>/', '', $text)) <= $length) {
            return $text;
        }
        $totalLength = mb_strlen($ending);
        $openTags = array();
        $truncate = '';
        preg_match_all('/(<\/?([\w+]+)[^>]*>)?([^<>]*)/', $text, $tags, PREG_SET_ORDER);
        foreach ($tags as $tag) {
            if (!preg_match('/img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param/s', $tag[2])) {
                if (preg_match('/<[\w]+[^>]*>/s', $tag[0])) {
                    array_unshift($openTags, $tag[2]);
                } else if (preg_match('/<\/([\w]+)[^>]*>/s', $tag[0], $closeTag)) {
                    $pos = array_search($closeTag[1], $openTags);
                    if ($pos !== false) {
                        array_splice($openTags, $pos, 1);
                    }
                }
            }
            $truncate .= $tag[1];

            $contentLength = mb_strlen(preg_replace('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', ' ', $tag[3]));
            if ($contentLength + $totalLength > $length) {
                $left = $length - $totalLength;
                $entitiesLength = 0;
                if (preg_match_all('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', $tag[3], $entities, PREG_OFFSET_CAPTURE)) {
                    foreach ($entities[0] as $entity) {
                        if ($entity[1] + 1 - $entitiesLength <= $left) {
                            $left--;
                            $entitiesLength += mb_strlen($entity[0]);
                        } else {
                            break;
                        }
                    }
                }

                $truncate .= mb_substr($tag[3], 0 , $left + $entitiesLength);
                break;
            } else {
                $truncate .= $tag[3];
                $totalLength += $contentLength;
            }
            if ($totalLength >= $length) {
                break;
            }
        }

    } else {
        if (mb_strlen($text) <= $length) {
            return $text;
        } else {
            $truncate = mb_substr($text, 0, $length - strlen($ending));
        }
    }
    if (!$exact) {
        $spacepos = mb_strrpos($truncate, ' ');
        if (isset($spacepos)) {
            if ($considerHtml) {
                $bits = mb_substr($truncate, $spacepos);
                preg_match_all('/<\/([a-z]+)>/', $bits, $droppedTags, PREG_SET_ORDER);
                if (!empty($droppedTags)) {
                    foreach ($droppedTags as $closingTag) {
                        if (!in_array($closingTag[1], $openTags)) {
                            array_unshift($openTags, $closingTag[1]);
                        }
                    }
                }
            }
            $truncate = mb_substr($truncate, 0, $spacepos);
        }
    }

    $truncate .= $ending;

    if ($considerHtml) {
        foreach ($openTags as $tag) {
            $truncate .= '</'.$tag.'>';
        }
    }

    return $truncate;
}
like image 42
deceze Avatar answered Oct 11 '22 00:10

deceze


Use PHP's DOMDocument class to normalize an HTML fragment:

$dom= new DOMDocument();
$dom->loadHTML('<div><p>Hello World');      
$xpath = new DOMXPath($dom);
$body = $xpath->query('/html/body');
echo($dom->saveXml($body->item(0)));

This question is similar to an earlier question and I've copied and pasted one solution here. If the HTML is submitted by users you'll also need to filter out potential Javascript attack vectors like onmouseover="do_something_evil()" or <a href="javascript:more_evil();">...</a>. Tools like HTML Purifier were designed to catch and solve these problems and are far more comprehensive than any code that I could post.

like image 35
leepowers Avatar answered Oct 10 '22 23:10

leepowers


I made another function to do it, it supports UTF-8:

/**
 * Limit string without break html tags.
 * Supports UTF8
 * 
 * @param string $value
 * @param int $limit Default 100
 */
function str_limit_html($value, $limit = 100)
{

    if (mb_strwidth($value, 'UTF-8') <= $limit) {
        return $value;
    }

    // Strip text with HTML tags, sum html len tags too.
    // Is there another way to do it?
    do {
        $len          = mb_strwidth($value, 'UTF-8');
        $len_stripped = mb_strwidth(strip_tags($value), 'UTF-8');
        $len_tags     = $len - $len_stripped;

        $value = mb_strimwidth($value, 0, $limit + $len_tags, '', 'UTF-8');
    } while ($len_stripped > $limit);

    // Load as HTML ignoring errors
    $dom = new DOMDocument();
    @$dom->loadHTML('<?xml encoding="utf-8" ?>'.$value, LIBXML_HTML_NODEFDTD);

    // Fix the html errors
    $value = $dom->saveHtml($dom->getElementsByTagName('body')->item(0));

    // Remove body tag
    $value = mb_strimwidth($value, 6, mb_strwidth($value, 'UTF-8') - 13, '', 'UTF-8'); // <body> and </body>
    // Remove empty tags
    return preg_replace('/<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*\/?>\s*<\/\1\s*>/', '', $value);
}

SEE DEMO.

I recommend use html_entity_decode at the start of function, so it preserves the UTF-8 characters:

 $value = html_entity_decode($value);
like image 22
SnakeDrak Avatar answered Oct 11 '22 01:10

SnakeDrak