Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split text into sentences, but skip quoted content

Tags:

regex

ruby

I want to split some text into sentences using regular expression (using Ruby). It does not need to be accurate, so cases such as "Washington D.C." can be ignored.

However I have an requirement that, if the sentence is quoted (by single or double quotes), then it should be ignored.

Say I have the following text:

Sentence One. "Wow." said Alice. Senetence Three.

It should be split into three sentences:

Sentence One.
"Wow." said Alice.
Sentence Three.

Currently I have content.scan(/[^\.!\?\n]*[\.!\?\n]/), but I have problem with quotes.

UPDATE:

The current answer can hit some performance issue. Try the following:

'Alice stood besides the table. She looked towards the rabbit, "Wait! Stop!", said Alice'.scan(regexp)

Would be nice if someone can figure out how to avoid it. Thanks!

like image 645
lulalala Avatar asked Oct 20 '25 01:10

lulalala


1 Answers

How about this:

result = subject.scan(
    /(?:      # Either match...
     "[^"]*"  # a quoted sentence
    |         # or
     [^".!?]* # anything except quotes or punctuation.
    )++       # Repeat as needed; avoid backtracking
    [.!?\s]*  # Then match optional punctuation characters and/or whitespace./x)
like image 146
Tim Pietzcker Avatar answered Oct 22 '25 06:10

Tim Pietzcker