How can I split a text into an array of sentences?
Example text:
Fry me a Beaver. Fry me a Beaver! Fry me a Beaver? Fry me Beaver no. 4?! Fry me many Beavers... End
Should output:
0 => Fry me a Beaver.
1 => Fry me a Beaver!
2 => Fry me a Beaver?
3 => Fry me Beaver no. 4?!
4 => Fry me many Beavers...
5 => End
I tried some solutions that I've found on SO through search, but they all fail, especially at the 4th sentence.
/(?<=[!?.])./
/\.|\?|!/
/((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]\"))(\s|\r\n)(?=\"?[A-Z])/
/(?<=[.!?]|[.!?][\'"])\s+/ // <- closest one
Obviously, if we are talking about a single paragraph with a few sentences, the answer is no brainer: you do it manually by placing your cursor at the end of each sentence and pressing the ENTER key twice.
Splitting text file with the split() method After using a Python with statement to open the data file, we can iterate through the file's contents with a for loop. Once the data is read, the split() method is used to separate the text into words.
Splitting a string by sentence as a delimiter You can also split a sentence by passing a sentence as a delimiter if you do so each time the specified sentence occurs the String is divided as a separate token.
Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into sentences.
Since you want to "split" sentences why are you trying to match them ?
For this case let's use preg_split().
Code:
$str = 'Fry me a Beaver. Fry me a Beaver! Fry me a Beaver? Fry me Beaver no. 4?! Fry me many Beavers... End';
$sentences = preg_split('/(?<=[.?!])\s+(?=[a-z])/i', $str);
print_r($sentences);
Output:
Array
(
[0] => Fry me a Beaver.
[1] => Fry me a Beaver!
[2] => Fry me a Beaver?
[3] => Fry me Beaver no. 4?!
[4] => Fry me many Beavers...
[5] => End
)
Explanation:
Well to put it simply we are spliting by grouped space(s) \s+ and doing two things:
(?<=[.?!]) Positive look behind assertion, basically we search if there is a point or question mark or exclamation mark behind the space.
(?=[a-z]) Positive look ahead assertion, searching if there is a letter after the space, this is kind of a workaround for the no. 4
problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With