I am using Ruby's StringScanner to normalize some English text.
def normalize text
s = ''
ss = StringScanner.new text
while ! ss.eos? do
s += ' ' if ss.scan(/\s+/) # mutiple whitespace => single space
s += 'mice' if ss.scan(/\bmouses\b/) # mouses => mice
s += '' if ss.scan(/\bthe\b/) # remove 'the'
s += "#$1 #$2" if ss.scan(/(\d)(\w+)/) # should split 3blind => 3 blind
end
s
end
normalize("3blind the mouses") #=> should return "3 blind mice"
Instead I am just getting " mice".
StringScanner#scan is not capturing the (\d) and (\w+).
To access a StringScanner captured (in Ruby 1.9 and above), you use StringScanner#[]:
s += "#{ss[1]} #{ss[2]}" if ss.scan(/(\d)(\w+)/) # splits 3blind => 3 blind
In Ruby 2.1, you should be able to capture by name (See Peter Alfvin's link)
s += "#{ss[:num]} #{ss[:word]}" if ss.scan(/(?<num>\d)(?<word>\w+)/)
Note: The first version of this/my answer was completely off base, per the comment thread. Apologies.
Based on experimentation and review of http://ruby-doc.org/stdlib-1.9.2/libdoc/strscan/rdoc/StringScanner.html, it appears that StringScanner does not set the match variables $1, $2, etc., so that last s += ... statement is only appending a blank to s.
Looking at strscan.c it appears that indeed there is no support for providing captured match information, but I did find https://www.ruby-forum.com/topic/4413436, which appears to be an in-progress effort of some sort to implement this
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With