Split a text into single words

Question

I would like to split a text into single words using PHP. Do you have any idea how to achieve this?

My approach:

function tokenizer($text) {
    $text = trim(strtolower($text));
    $punctuation = '/[^a-z0-9äöüß-]/';
    $result = preg_split($punctuation, $text, -1, PREG_SPLIT_NO_EMPTY);
    for ($i = 0; $i < count($result); $i++) {
        $result[$i] = trim($result[$i]);
    }
    return $result; // contains the single words
}
$text = 'This is an example text, it contains commas and full-stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
print_r(tokenizer($text));

Is this a good approach? Do you have any idea for improvement?

Thanks in advance!

moinudin · Accepted Answer

Use the class \p{P} which matches any unicode punctuation character, combined with the \s whitespace class.

$result = preg_split('/((^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/', $text, -1, PREG_SPLIT_NO_EMPTY);

This will split on a group of one or more whitespace characters, but also suck in any surrounding punctuation characters. It also matches punctuation characters at the beginning or end of the string. This discriminates cases such as "don't" and "he said 'ouch!'"

Eugene Yokota · Answer

Tokenize - strtok.

<?php
$text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
$delim = ' 
	,.!?:;';

$tok = strtok($text, $delim);

while ($tok !== false) {
    echo "Word=$tok<br />";
    $tok = strtok($delim);
}
?>

Gumbo · Answer

I would first make the string to lower-case before splitting it up. That would make the i modifier and the array processing afterwards unnecessary. Additionally I would use the \W shorthand for non-word characters and add a + multiplier.

$text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
$result = preg_split('/\W+/', strtolower($text), -1, PREG_SPLIT_NO_EMPTY);

Edit Use the Unicode character properties instead of \W as marcog suggested. Something like [\p{P}\p{Z}] (punctuation and separator characters) would cover the characters more specific than \W.

farzad · Answer

you can also use PHP strtok() function to fetch string tokens from your large string. you can use it like this:

 $result = array();
 // your original string
 $text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
 // you pass strtok() your string, and a delimiter to specify how tokens are separated. words are seperated by a space.
 $word = strtok($text,' ');
 while ( $word !== false ) {
     $result[] = $word;
     $word = strtok(' ');
 }

see more on php documentation for strtok()

Split a text into single words

Tags:

php

split

caw

4 Answers

moinudin

Eugene Yokota

Gumbo

farzad

Recent Activity

Donate For Us

Split a text into single words

Tags:

php

split

caw

4 Answers

moinudin

Eugene Yokota

Gumbo

farzad

Related questions

Recent Activity

Donate For Us