Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging two Regular Expressions to Truncate Words in Strings

I'm trying to come up with the following function that truncates string to whole words (if possible, otherwise it should truncate to chars):

function Text_Truncate($string, $limit, $more = '...')
{
    $string = trim(html_entity_decode($string, ENT_QUOTES, 'UTF-8'));

    if (strlen(utf8_decode($string)) > $limit)
    {
        $string = preg_replace('~^(.{1,' . intval($limit) . '})(?:\s.*|$)~su', '$1', $string);

        if (strlen(utf8_decode($string)) > $limit)
        {
            $string = preg_replace('~^(.{' . intval($limit) . '}).*~su', '$1', $string);
        }

        $string .= $more;
    }

    return trim(htmlentities($string, ENT_QUOTES, 'UTF-8', true));
}

Here are some tests:

// Iñtërnâtiônàlizætiøn and then the quick brown fox... (49 + 3 chars)
echo dyd_Text_Truncate('Iñtërnâtiônàlizætiøn and then the quick brown fox jumped overly the lazy dog and one day the lazy dog humped the poor fox down until she died.', 50, '...');

// Iñtërnâtiônàlizætiøn_and_then_the_quick_brown_fox_...  (50 + 3 chars)
echo dyd_Text_Truncate('Iñtërnâtiônàlizætiøn_and_then_the_quick_brown_fox_jumped_overly_the_lazy_dog and one day the lazy dog humped the poor fox down until she died.', 50, '...');

They both work as it is, however if I drop the second preg_replace() I get the following:

Iñtërnâtiônàlizætiøn_and_then_the_quick_brown_fox_jumped_overly_the_lazy_dog and one day the lazy dog humped the poor fox down until she died....

I can't use substr() because it only works on byte level and I don't have access to mb_substr() ATM, I've made several attempts to join the second regex with the first one but without success.

Please help S.M.S., I've been struggling with this for almost an hour.


EDIT: I'm sorry, I've been awake for 40 hours and I shamelessly missed this:

$string = preg_replace('~^(.{1,' . intval($limit) . '})(?:\s.*|$)?~su', '$1', $string);

Still, if someone has a more optimized regex (or one that ignores the trailing space) please share:

"Iñtërnâtiônàlizætiøn and then "
"Iñtërnâtiônàlizætiøn_and_then_"

EDIT 2: I still can't get rid of the trailing whitespace, can someone help me out?

EDIT 3: Okay, none of my edits did really work, I was being fooled by RegexBuddy - I should probably leave this to another day and get some sleep now. Off for today.

like image 364
Alix Axel Avatar asked Apr 21 '10 12:04

Alix Axel


1 Answers

Perhaps I can give you a happy morning after a long night of RegExp nightmares:

'~^(.{1,' . intval($limit) . '}(?<=\S)(?=\s)|.{'.intval($limit).'}).*~su'

Boiling it down:

^      # Start of String
(       # begin capture group 1
 .{1,x} # match 1 - x characters
 (?<=\S)# lookbehind, match must end with non-whitespace 
 (?=\s) # lookahead, if the next char is whitespace, match
 |      # otherwise test this:
 .{x}   # got to x chars anyway.
)       # end cap group
.*     # match the rest of the string (since you were using replace)

You could always add the |$ to the end of (?=\s) but since your code was already checking that the string length was longer than the $limit, I didn't feel that case would be neccesary.

like image 178
gnarf Avatar answered Sep 27 '22 18:09

gnarf