Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split text in half, but at the nearest sentence

Tags:

html

string

php

Example of a $text variable:

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Splitting it in half:

$half = strlen($text) / 2;

will get me to the o character in consequat.

How can I find the position of the nearest sentence delimiter (dot) in the middle of the text? In this example it's 7 characters after that o.

Also this text contains HTML code.
I want to ignore the HTML when finding out the half-point of the text, and ignore dots from within html attributes etc.

like image 877
Alex Avatar asked May 08 '12 02:05

Alex


2 Answers

Take a look at substr, strip_tags and strpos. With the help of strpos you find the position of the next dot and with strip_tags you strip all the html tags from the string.

$string = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborumt.';
$string = strip_tags($string);
$half = intval(strlen($string)/2);
echo substr($string, 0, strpos($string, '.', $half)+1);

Consider that you have to make sure a dot exists after the value of $half or else the output is not going to be what you desire.

Perhaps something like this?

if (strpos($string, '.', $half) !== false)
    echo substr($string, 0, strpos($string, '.', $half)+1);
else
    echo substr($string, 0, $half) . '...';
like image 88
mpratt Avatar answered Sep 30 '22 10:09

mpratt


Assuming your sentence can end with other characters than period, you could look at this:

$s = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.';

// find center (rounded down)
$mid = floor(strlen($s) / 2);
// find range of characters from center that are not ?, ! or .
$r = strcspn($s, '.!?', $mid);

// remember to include the punctuation character
echo substr($s, 0, $mid + $r + 1);

You may need to tweak it a little, but it should do it's job well. For more advanced stuff you're treading into NLP (natural language processing) territory, for which there are also libraries in PHP available:

http://sourceforge.net/projects/nlp/

like image 42
Ja͢ck Avatar answered Sep 30 '22 10:09

Ja͢ck