Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I split a sentence into words and punctuation marks?

For example, I want to split this sentence:

I am a sentence.

Into an array with 5 parts; I, am, a, sentence, and ..

I'm currently using preg_split after trying explode, but I can't seem to find something suitable.

This is what I've tried:

$sentence = explode(" ", $sentence);
/*
returns array(4) {
  [0]=>
  string(1) "I"
  [1]=>
  string(2) "am"
  [2]=>
  string(1) "a"
  [3]=>
  string(8) "sentence."
}
*/

And also this:

$sentence = preg_split("/[.?!\s]/", $sentence);
/*
returns array(5) {
  [0]=>
  string(1) "I"
  [1]=>
  string(2) "am"
  [2]=>
  string(1) "a"
  [3]=>
  string(8) "sentence"
  [4]=>
  string(0) ""
}
*/

How can this be done?

like image 326
Lucas Avatar asked Dec 27 '22 05:12

Lucas


2 Answers

You can split on word boundaries:

$sentence = preg_split("/(?<=\w)\b\s*/", 'I am a sentence.');

Pretty much the regex scans until a word character is found, then after it, the regex must capture a word boundary and some optional space.

Output:

array(5) {
  [0]=>
  string(1) "I"
  [1]=>
  string(2) "am"
  [2]=>
  string(1) "a"
  [3]=>
  string(8) "sentence"
  [4]=>
  string(1) "."
}
like image 111
nickb Avatar answered Jan 24 '23 02:01

nickb


I was looking for the same solution and landed here. The accepted solution does not work with non-word characters like apostrophes and accent marks and so forth. Below, find the solution that worked for me.

Here is my test sentence:

Claire’s favorite sonata for piano is Mozart’s Sonata no. 15 in C Major.

The accepted answer gave me the following results:

Array
(
    [0] => Claire
    [1] => ’s
    [2] => favorite
    [3] => sonata
    [4] => for
    [5] => piano
    [6] => is
    [7] => Mozart
    [8] => ’s
    [9] => Sonata
    [10] => no
    [11] => . 15
    [12] => in
    [13] => C
    [14] => Major
    [15] => .
)

The solution I came up with follows:

$parts = preg_split("/\s+|\b(?=[!\?\.])(?!\.\s+)/", $sentence);

It gives the following results:

Array
(
    [0] => Claire’s
    [1] => favorite
    [2] => sonata
    [3] => for
    [4] => piano
    [5] => is
    [6] => Mozart’s
    [7] => Sonata
    [8] => no.
    [9] => 15
    [10] => in
    [11] => C
    [12] => Major
    [13] => .
)
like image 27
dlporter98 Avatar answered Jan 24 '23 02:01

dlporter98