Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ruby: how to split a string on a regex while keeping the delimiters? [duplicate]

This has been asked multiple times around here, but never got a generic answer, so here we go:

Say you have a string, any string, but let's go with "oruh43451rohcs56oweuex59869rsr", and you want to split it with a regular expression. Any regular expression, but let's go with a sequence of digits: /\d+/. Then you'd use split:

"oruh43451rohcs56oweuex59869rsr".split(/\d+/)
# => ["oruh", "rohcs", "oweuex", "rsr"]

That's lovely and all, but I want the digits. So for that we have scan:

"oruh43451rohcs56oweuex59869rsr".scan(/\d+/)
# => ["43451", "56", "59869"]

But I want it all! Is there, say, a split_and_scan? Nope.

How about I split and scan then zip them? Let me stop you right there.

Ok, so how?

like image 633
kch Avatar asked Mar 07 '26 14:03

kch


2 Answers

If split's pattern contains a capture group, the group will be included in the resulting array.

str = "oruh43451rohcs56oweuex59869rsr"
str.split(/(\d+)/)
# => ["oruh", "43451", "rohcs", "56", "oweuex", "59869", "rsr"]

If you want it zipped,

str.split(/(\d+)/).each_slice(2).to_a
# => [["oruh", "43451"], ["rohcs", "56"], ["oweuex", "59869"], ["rsr"]]
like image 62
Amadan Avatar answered Mar 10 '26 04:03

Amadan


I'm glad you asked… well, there's String#shatter from Facets. I don't love it because it's implemented using trickery (look at the source, it's cute clever trickery, but what if your string actually contains a "\1"?).

So I rolled my own. Here's what you get:

"oruh43451rohcs56oweuex59869rsr".unjoin(/\d+/)
# => ["oruh", "43451", "rohcs", "56", "oweuex", "59869", "rsr"]

And here's the implementation:

class Object
  def unfold(&f)
    (m, n = f[self]).nil? ? [] : n.unfold(&f).unshift(m)
  end
end

class String
  def unjoin(rx)
    unfold do |s|
      next if s.empty?
      ix = s =~ rx
      case
      when ix.nil?; [s , ""]
      when ix == 0; [$&, $']
      when ix >  0; [$`, $& + $']
      end
    end
  end
end

(verbosier version at the bottom)

And here are some examples of corner cases being handled:

"".unjoin(/\d+/)     # => []
"w".unjoin(/\d+/)    # => ["w"]
"1".unjoin(/\d+/)    # => ["1"]
"w1".unjoin(/\d+/)   # => ["w", "1"]
"1w".unjoin(/\d+/)   # => ["1", "w"]
"1w1".unjoin(/\d+/)  # => ["1", "w", "1"]
"w1w".unjoin(/\d+/)  # => ["w", "1", "w"]

And that's it, but here's more…

Or, if you don't like mucking with the built-in classes… well, you could use Refinements… but if you really don't like it, here it is as functions:

def unfold(x, &f)
  (m, n = f[x]).nil? ? [] : unfold(n, &f).unshift(m)
end

def unjoin(s, rx)
  unfold(s) do |s|
    next if s.empty?
    ix = s =~ rx
    case
    when ix.nil?; [s , ""]
    when ix == 0; [$&, $']
    when ix >  0; [$`, $& + $']
    end
  end
end

It also occurs to me that it may not always be clear which are the separators and which are the separated bits, so here's a little addition that lets you query a string with #joint? to know what role it played before the split:

class String

  def joint?
    false
  end

  class Joint < String
    def joint?
      true
    end
  end

  def unjoin(rx)
    unfold do |s|
      next if s.empty?
      ix = s =~ rx
      case
      when ix.nil?; [s, ""]
      when ix == 0; [Joint.new($&), $']
      when ix >  0; [$`, $& + $']
      end
    end
  end
end

and here it is in use:

"oruh43451rohcs56oweuex59869rsr".unjoin(/\d+/)\
  .map { |s| s.joint? ? "(#{s})" : s }.join(" ")
# => "oruh (43451) rohcs (56) oweuex (59869) rsr"

You can now easily reimplement split and scan:

class String

  def split2(rx)
    unjoin(rx).reject(&:joint?)
  end

  def scan2(rx)
    unjoin(rx).select(&:joint?)
  end

end

"oruh43451rohcs56oweuex59869rsr".split2(/\d+/)
# => ["oruh", "rohcs", "oweuex", "rsr"]

"oruh43451rohcs56oweuex59869rsr".scan2(/\d+/)
# => ["43451", "56", "59869"]

And if you hate match globals and general brevity…

class Object
  def unfold(&map_and_next)
    result = map_and_next.call(self)
    return [] if result.nil?
    mapped_value, next_value = result
    [mapped_value] + next_value.unfold(&map_and_next)
  end
end

class String
  def unjoin(regex)
    unfold do |tail_string|
      next if tail_string.empty?
      match = tail_string.match(regex)
      index = match.begin(0)
      case
      when index.nil?; [tail_string, ""]
      when index == 0; [match.to_s, match.post_match]
      when index >  0; [match.pre_match, match.to_s + match.post_match]
      end
    end
  end
end
like image 21
kch Avatar answered Mar 10 '26 04:03

kch



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!