I'm currently struggling to come up with a regex that can split up a string into words where words are defined as a sequence of characters surrounded by whitespace, or enclosed between double quotes. I'm using String#scan
For instance, the string:
' hello "my name" is "Tom"'
should match the words:
hello
my name
is
Tom
I managed to match the words enclosed in double quotes by using:
/"([^\"]*)"/
but I can't figure out how to incorporate the surrounded by whitespace characters to get 'hello', 'is', and 'Tom' while at the same time not screw up 'my name'.
Any help with this would be appreciated!
result = ' hello "my name" is "Tom"'.split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/)
will work for you. It will print
=> ["", "hello", "\"my name\"", "is", "\"Tom\""]
Just ignore the empty strings.
Explanation
"
\\s # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
(?: # Match the regular expression below
[^\"] # Match any character that is NOT a “\"”
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\" # Match the character “\"” literally
[^\"] # Match any character that is NOT a “\"”
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\" # Match the character “\"” literally
)* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
[^\"] # Match any character that is NOT a “\"”
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\$ # Assert position at the end of a line (at the end of the string or before a line break character)
)
"
You can use reject
like this to avoid empty strings
result = ' hello "my name" is "Tom"'
.split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/).reject {|s| s.empty?}
prints
=> ["hello", "\"my name\"", "is", "\"Tom\""]
text = ' hello "my name" is "Tom"'
text.scan(/\s*("([^"]+)"|\w+)\s*/).each {|match| puts match[1] || match[0]}
Produces:
hello
my name
is
Tom
Explanation:
0 or more spaces followed by
either
some words within double-quotes OR
a single word
followed by 0 or more spaces
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With