Identifying whitespace vs other character runs in a string




Given the strings:

strs = [
  "    ",
  "Hello \n there",
  " Ooh, leading and trailing space!  ",

I want a simple method identifying all contiguous runs of whitespace and non-whitespace characters, in order, along with whether the run is whitespace or not:

strs.each{ |str| p find_whitespace_runs(str) }
#=> [ {k:1, s:"foo"} ],
#=> [ {k:0, s:"    "} ],
#=> [ {k:1, s:"Hello"}, {k:0, s:" \n "}, {k:1, s:"World"} ],
#=> [
#=>   {k:0, s:" "},
#=>   {k:1, s:"Ooh,"},
#=>   {k:0, s:" "},
#=>   {k:1, s:"leading"},
#=>   {k:0, s:" "},
#=>   {k:1, s:"and"},
#=>   {k:0, s:" "},
#=>   {k:1, s:"trailing"},
#=>   {k:0, s:" "},
#=>   {k:1, s:"space!"},
#=>   {k:0, s:"  "},
#=> ]

This almost works, but includes a single leading {k:0, s:""} group whenever the string does not start with whitespace:

def find_whitespace_runs(str)
  str.split(/(\S+)/).map.with_index do |s,i|
    {k:i%2, s:s}

Real-world motivation: writing a syntax highlighter that distinguishes whitespace from non-whitespace in otherwise-unlexed code.

1 Answers

def find_whitespace_runs(str)
  str.scan(/((\s+)|(\S+))/).map { |full, ws, nws|
    { :k => nws ? 1 : 0, :s => full } 
