Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split body of text into sentences but keep punctuation?

Tags:

regex

ruby

I'm attempting to produce a human readable wiki-like difference between 2 bodies of html laden text. I'm using diff-lcs and the first step is separating the string (array of characters) into an array of sentences, but keep their punctuation.

"I am a lion. Hear me roar! Where is my cub? Never mind, found him.".magic_split(/[.?!]/)
# => "I am a lion." "Hear me roar!" "Where is my cub?" "Never mind, found him."

This should do the trick

"I am a lion. Hear me roar! Where is my cub? Never mind, found him.".gsub(/[.?!]/, '\1|').split('|')

Except gsub appears to have trouble inserting the characters .?!. Instead it's returning this

"I am a lion| Hear me roar| Where is my cub| Never mind, found him|"

What's the easiest way to do a non-destructive split? As in it keeps the characters it splits by.

like image 261
Archonic Avatar asked Mar 28 '13 16:03

Archonic


2 Answers

scan should do the trick (throw strip in there to get rid of trailing spaces).

s = "I am a lion. Hear me roar! Where is my cub? Never mind, found him."
s.scan(/[^\.!?]+[\.!?]/).map(&:strip) # => ["I am a lion.", "Hear me roar!", "Where is my cub?", "Never mind, found him."]
like image 72
Sergio Tulentsev Avatar answered Sep 20 '22 22:09

Sergio Tulentsev


I think that should be \0

>> string = "I am a lion. Hear me roar! Where is my cub? Never mind, found him."
>> string.gsub(/[.?!]/, '\0|') 
   # "I am a lion.| Hear me roar!| Where is my cub?| Never mind, found him.|"
like image 21
jvnill Avatar answered Sep 20 '22 22:09

jvnill