Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split a text into sentences

How can I split a text into an array of sentences?

Example text:

Fry me a Beaver. Fry me a Beaver! Fry me a Beaver? Fry me Beaver no. 4?! Fry me many Beavers... End

Should output:

0 => Fry me a Beaver.
1 => Fry me a Beaver!
2 => Fry me a Beaver?
3 => Fry me Beaver no. 4?!
4 => Fry me many Beavers...
5 => End

I tried some solutions that I've found on SO through search, but they all fail, especially at the 4th sentence.

/(?<=[!?.])./

/\.|\?|!/

/((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]\"))(\s|\r\n)(?=\"?[A-Z])/

/(?<=[.!?]|[.!?][\'"])\s+/    // <- closest one
like image 758
thelolcat Avatar asked May 04 '13 18:05

thelolcat


People also ask

How do you split a paragraph in a sentence?

Obviously, if we are talking about a single paragraph with a few sentences, the answer is no brainer: you do it manually by placing your cursor at the end of each sentence and pressing the ENTER key twice.

How do you split a text file into a sentence in Python?

Splitting text file with the split() method After using a Python with statement to open the data file, we can iterate through the file's contents with a for loop. Once the data is read, the split() method is used to separate the text into words.

How do you split a string in a sentence?

Splitting a string by sentence as a delimiter You can also split a sentence by passing a sentence as a delimiter if you do so each time the specified sentence occurs the String is divided as a separate token.

Which of the following function is used to break given text into sentences?

Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into sentences.


1 Answers

Since you want to "split" sentences why are you trying to match them ?

For this case let's use preg_split().

Code:

$str = 'Fry me a Beaver. Fry me a Beaver! Fry me a Beaver? Fry me Beaver no. 4?! Fry me many Beavers... End';
$sentences = preg_split('/(?<=[.?!])\s+(?=[a-z])/i', $str);
print_r($sentences);

Output:

Array
(
    [0] => Fry me a Beaver.
    [1] => Fry me a Beaver!
    [2] => Fry me a Beaver?
    [3] => Fry me Beaver no. 4?!
    [4] => Fry me many Beavers...
    [5] => End
)

Explanation:

Well to put it simply we are spliting by grouped space(s) \s+ and doing two things:

  1. (?<=[.?!]) Positive look behind assertion, basically we search if there is a point or question mark or exclamation mark behind the space.

  2. (?=[a-z]) Positive look ahead assertion, searching if there is a letter after the space, this is kind of a workaround for the no. 4 problem.

like image 143
HamZa Avatar answered Oct 02 '22 19:10

HamZa