I want to truncate some text (loaded from a database or text file), but it contains HTML so as a result the tags are included and less text will be returned. This can then result in tags not being closed, or being partially closed (so Tidy may not work properly and there is still less content). How can I truncate based on the text (and probably stopping when you get to a table as that could cause more complex issues). <pre class="prettyprint"><code>substr("Hello, my name is Sam. I&acute;m a web developer.",0,26)."..." </code></pre> Would result in: <pre class="prettyprint"><code>Hello, my name</st... </code></pre> What I would want is: <pre class="prettyprint"><code>Hello, my name is Sam. I&acute;m... </code></pre> How can I do this? While my question is for how to do it in PHP, it would be good to know how to do it in C#... either should be OK as I think I would be able to port the method over (unless it is a built in method). Also note that I have included an HTML entity <code>&acute;</code> - which would have to be considered as a single character (rather than 7 characters as in this example). <code>strip_tags</code> is a fallback, but I would lose formatting and links and it would still have the problem with HTML entities.

100% accurate, but pretty difficult approach: <ol> <li>Iterate charactes using DOM</li> <li>Use DOM methods to remove remaining elements</li> <li>Serialize the DOM</li> </ol> Easy brute-force approach: <ol> <li>Split string into tags (not elements) and text fragments using <code>preg_split('/(<tag>)/')</code> with PREG_DELIM_CAPTURE.</li> <li>Measure text length you want (it'll be every second element from split, you might use <code>html_entity_decode()</code> to help measure accurately)</li> <li>Cut the string (trim <code>&[^\s;]+$</code> at the end to get rid of possibly chopped entity)</li> <li>Fix it with HTML Tidy</li> </ol>

Truncate text containing HTML, ignoring tags

Tags:

html

string

php

markup

I want to truncate some text (loaded from a database or text file), but it contains HTML so as a result the tags are included and less text will be returned. This can then result in tags not being closed, or being partially closed (so Tidy may not work properly and there is still less content). How can I truncate based on the text (and probably stopping when you get to a table as that could cause more complex issues).

substr("Hello, my <strong>name</strong> is <em>Sam</em>. I&acute;m a web developer.",0,26)."..."

Would result in:

Hello, my <strong>name</st...

What I would want is:

Hello, my <strong>name</strong> is <em>Sam</em>. I&acute;m...

How can I do this?

While my question is for how to do it in PHP, it would be good to know how to do it in C#... either should be OK as I think I would be able to port the method over (unless it is a built in method).

Also note that I have included an HTML entity ´ - which would have to be considered as a single character (rather than 7 characters as in this example).

strip_tags is a fallback, but I would lose formatting and links and it would still have the problem with HTML entities.

900

asked Jul 28 '09 11:07

SamWM

6 Answers

Assuming you are using valid XHTML, it's simple to parse the HTML and make sure tags are handled properly. You simply need to track which tags have been opened so far, and make sure to close them again "on your way out".

<?php header('Content-type: text/plain; charset=utf-8');  function printTruncated($maxLength, $html, $isUtf8=true) {     $printedLength = 0;     $position = 0;     $tags = array();      // For UTF-8, we need to count multibyte sequences as one character.     $re = $isUtf8         ? '{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;|[\x80-\xFF][\x80-\xBF]*}'         : '{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}';      while ($printedLength < $maxLength && preg_match($re, $html, $match, PREG_OFFSET_CAPTURE, $position))     {         list($tag, $tagPosition) = $match[0];          // Print text leading up to the tag.         $str = substr($html, $position, $tagPosition - $position);         if ($printedLength + strlen($str) > $maxLength)         {             print(substr($str, 0, $maxLength - $printedLength));             $printedLength = $maxLength;             break;         }          print($str);         $printedLength += strlen($str);         if ($printedLength >= $maxLength) break;          if ($tag[0] == '&' || ord($tag) >= 0x80)         {             // Pass the entity or UTF-8 multibyte sequence through unchanged.             print($tag);             $printedLength++;         }         else         {             // Handle the tag.             $tagName = $match[1][0];             if ($tag[1] == '/')             {                 // This is a closing tag.                  $openingTag = array_pop($tags);                 assert($openingTag == $tagName); // check that tags are properly nested.                  print($tag);             }             else if ($tag[strlen($tag) - 2] == '/')             {                 // Self-closing tag.                 print($tag);             }             else             {                 // Opening tag.                 print($tag);                 $tags[] = $tagName;             }         }          // Continue after the tag.         $position = $tagPosition + strlen($tag);     }      // Print any remaining text.     if ($printedLength < $maxLength && $position < strlen($html))         print(substr($html, $position, $maxLength - $printedLength));      // Close any open tags.     while (!empty($tags))         printf('</%s>', array_pop($tags)); }   printTruncated(10, '<b>&lt;Hello&gt;</b> <img src="world.png" alt="" /> world!'); print("\n");  printTruncated(10, '<table><tr><td>Heck, </td><td>throw</td></tr><tr><td>in a</td><td>table</td></tr></table>'); print("\n");  printTruncated(10, "<em><b>Hello</b>&#20;w\xC3\xB8rld!</em>"); print("\n");

Encoding note: The above code assumes the XHTML is UTF-8 encoded. ASCII-compatible single-byte encodings (such as Latin-1) are also supported, just pass false as the third argument. Other multibyte encodings are not supported, though you may hack in support by using mb_convert_encoding to convert to UTF-8 before calling the function, then converting back again in every print statement.

(You should always be using UTF-8, though.)

Edit: Updated to handle character entities and UTF-8. Fixed bug where the function would print one character too many, if that character was a character entity.

195

answered Oct 07 '22 12:10

Søren Løvborg

I've written a function that truncates HTML just as yous suggest, but instead of printing it out it puts it just keeps it all in a string variable. handles HTML Entities, as well.

 /**      *  function to truncate and then clean up end of the HTML,      *  truncates by counting characters outside of HTML tags      *        *  @author alex lockwood, alex dot lockwood at websightdesign      *        *  @param string $str the string to truncate      *  @param int $len the number of characters      *  @param string $end the end string for truncation      *  @return string $truncated_html      *        *  **/         public static function truncateHTML($str, $len, $end = '&hellip;'){             //find all tags             $tagPattern = '/(<\/?)([\w]*)(\s*[^>]*)>?|&[\w#]+;/i';  //match html tags and entities             preg_match_all($tagPattern, $str, $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER );             //WSDDebug::dump($matches); exit;              $i =0;             //loop through each found tag that is within the $len, add those characters to the len,             //also track open and closed tags             // $matches[$i][0] = the whole tag string  --the only applicable field for html enitities               // IF its not matching an &htmlentity; the following apply             // $matches[$i][1] = the start of the tag either '<' or '</'               // $matches[$i][2] = the tag name             // $matches[$i][3] = the end of the tag             //$matces[$i][$j][0] = the string             //$matces[$i][$j][1] = the str offest              while($matches[$i][0][1] < $len && !empty($matches[$i])){                  $len = $len + strlen($matches[$i][0][0]);                 if(substr($matches[$i][0][0],0,1) == '&' )                     $len = $len-1;                   //if $matches[$i][2] is undefined then its an html entity, want to ignore those for tag counting                 //ignore empty/singleton tags for tag counting                 if(!empty($matches[$i][2][0]) && !in_array($matches[$i][2][0],array('br','img','hr', 'input', 'param', 'link'))){                     //double check                      if(substr($matches[$i][3][0],-1) !='/' && substr($matches[$i][1][0],-1) !='/')                         $openTags[] = $matches[$i][2][0];                     elseif(end($openTags) == $matches[$i][2][0]){                         array_pop($openTags);                     }else{                         $warnings[] = "html has some tags mismatched in it:  $str";                     }                 }                   $i++;              }              $closeTags = '';              if (!empty($openTags)){                 $openTags = array_reverse($openTags);                 foreach ($openTags as $t){                     $closeTagString .="</".$t . ">";                  }             }              if(strlen($str)>$len){                 // Finds the last space from the string new length                 $lastWord = strpos($str, ' ', $len);                 if ($lastWord) {                     //truncate with new len last word                     $str = substr($str, 0, $lastWord);                     //finds last character                     $last_character = (substr($str, -1, 1));                     //add the end text                     $truncated_html = ($last_character == '.' ? $str : ($last_character == ',' ? substr($str, 0, -1) : $str) . $end);                 }                 //restore any open tags                 $truncated_html .= $closeTagString;               }else             $truncated_html = $str;               return $truncated_html;          }

answered Oct 07 '22 13:10

alockwood05

100% accurate, but pretty difficult approach:

Iterate charactes using DOM
Use DOM methods to remove remaining elements
Serialize the DOM

Easy brute-force approach:

Split string into tags (not elements) and text fragments using preg_split('/(<tag>)/') with PREG_DELIM_CAPTURE.
Measure text length you want (it'll be every second element from split, you might use html_entity_decode() to help measure accurately)
Cut the string (trim &[^\s;]+$ at the end to get rid of possibly chopped entity)
Fix it with HTML Tidy

answered Oct 07 '22 11:10

Kornel

I used a nice function found at http://alanwhipple.com/2011/05/25/php-truncate-string-preserving-html-tags-words, apparently taken from CakePHP

answered Oct 07 '22 12:10

periklis

The following is a simple state-machine parser which handles you test case successfully. I fails on nested tags though as it doesn't track the tags themselves. I also chokes on entities within HTML tags (e.g. in an href-attribute of an <a>-tag). So it cannot be considered a 100% solution to this problem but because it's easy to understand it could be the basis for a more advanced function.

function substr_html($string, $length)
{
    $count = 0;
    /*
     * $state = 0 - normal text
     * $state = 1 - in HTML tag
     * $state = 2 - in HTML entity
     */
    $state = 0;    
    for ($i = 0; $i < strlen($string); $i++) {
        $char = $string[$i];
        if ($char == '<') {
            $state = 1;
        } else if ($char == '&') {
            $state = 2;
            $count++;
        } else if ($char == ';') {
            $state = 0;
        } else if ($char == '>') {
            $state = 0;
        } else if ($state === 0) {
            $count++;
        }

        if ($count === $length) {
            return substr($string, 0, $i + 1);
        }
    }
    return $string;
}

answered Oct 07 '22 12:10

Stefan Gehrig

you can use tidy as well:

function truncate_html($html, $max_length) {   
  return tidy_repair_string(substr($html, 0, $max_length),
     array('wrap' => 0, 'show-body-only' => TRUE), 'utf8'); 
}

answered Oct 07 '22 13:10

gpilotino

Related questions
                            
                                window.history.pushState refreshing the browser
                            
                                What's the difference between __construct() and init()
                            
                                Error 405 (Method Not Allowed) Laravel 5
                            
                                Laravel Custom Model Methods
                            
                                Verify receipt for in App purchase
                            
                                Bool parameter from jQuery Ajax received as literal string "false"/"true" in PHP
                            
                                How to use HTTP_X_FORWARDED_FOR properly?
                            
                                Practical Zend_ACL + Zend_Auth implementation and best practices
                            
                                mysql_real_escape_string VS addslashes
                            
                                What is the length of a PHP session id string?
                            
                                What is the purpose of the server.php file in Laravel 4?
                            
                                PHP printed boolean value is empty, why?
                            
                                PHP: $_SERVER['REDIRECT_URL'] vs $_SERVER['REQUEST_URI']
                            
                                PHP &$string - What does this mean? [duplicate]
                            
                                How can I send an Ajax Request on button click from a form with 2 buttons?
                            
                                How to create a zip file using PHP [duplicate]
                            
                                Why is overriding method parameters a violation of strict standards in PHP?
                            
                                How do you use PHPUnit to test a function if that function is supposed to kill PHP?
                            
                                Using XML node names with hyphens in PHP [duplicate]
                            
                                Lumen (Laravel) Eloquent php artisan make:model not defined

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Truncate text containing HTML, ignoring tags

Tags:

html

string

php

markup

SamWM

People also ask

6 Answers

Søren Løvborg

alockwood05

Kornel

periklis

Stefan Gehrig

gpilotino

Recent Activity

Donate For Us