How do I, from an output, only select the first 10 words?
implode(' ', array_slice(explode(' ', $sentence), 0, 10));
To add support for other word breaks like commas and dashes, preg_match
gives a quick way and doesn't require splitting the string:
function get_words($sentence, $count = 10) { preg_match("/(?:\w+(?:\W+|$)){0,$count}/", $sentence, $matches); return $matches[0]; }
As Pebbl mentions, PHP doesn't handle UTF-8 or Unicode all that well, so if that is a concern then you can replace \w
for [^\s,\.;\?\!]
and \W
for [\s,\.;\?\!]
.
Simply splitting on spaces will function incorrectly if there is an unexpected character in place of a space in the sentence structure, or if the sentence contains multiple conjoined spaces.
The following version will work no matter what kind of "space" you use between words and can be easily extended to handle other characters... it currently supports any white space character plus , . ; ? !
function get_snippet( $str, $wordCount = 10 ) { return implode( '', array_slice( preg_split( '/([\s,\.;\?\!]+)/', $str, $wordCount*2+1, PREG_SPLIT_DELIM_CAPTURE ), 0, $wordCount*2-1 ) ); }
Regular expressions are perfect for this issue, because you can easily make the code as flexible or strict as you like. You do have to be careful however. I specifically approached the above targeting the gaps between words — rather than the words themselves — because it is rather difficult to state unequivocally what will define a word.
Take the \w
word boundary, or its inverse \W
. I rarely rely on these, mainly because — depending on the software you are using (like certain versions of PHP) — they don't always include UTF-8 or Unicode characters.
In regular expressions it is better to be specific, at all times. So that your expressions can handle things like the following, no matter where they are rendered:
echo get_snippet('Это не те дроиды, которые вы ищете', 5); /// outputs: Это не те дроиды, которые
Avoiding splitting could be worthwhile however, in terms of performance. So you could use Kelly's updated approach but switch \w
for [^\s,\.;\?\!]+
and \W
for [\s,\.;\?\!]+
. Although, personally I like the simplicity of the splitting expression used above, it is easier to read and therefore modify. The stack of PHP functions however, is a bit ugly :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With