"d̪".chars.to_a
gives me
["d"," ̪"]
How do I get Ruby to split it by graphemes?
["d̪"]
Edit: As @michau's answer notes, Ruby 2.5 introduced the grapheme_clusters method, as well as each_grapheme_cluster if you just want to iterate/enumerate without necessarily creating an array.
In Ruby 2.0 or above you can use str.scan /\X/
> "d̪".scan /\X/
=> ["d̪"]
> "d̪d̪d̪".scan /\X/
=> ["d̪", "d̪", "d̪"]
# Let's get crazy:
> str = 'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞'
> str.length
=> 75
> str.scan(/\X/).length
=> 6
If you want to match the grapheme boundaries for any reason, you can use (?=\X) in your regex, for instance:
> "d̪".split /(?=\X)/
=> ["d̪"]
ActiveSupport (which is included in Rails) also has a way if you can't use \X for some reason:
ActiveSupport::Multibyte::Unicode.unpack_graphemes("d̪").map { |codes| codes.pack("U*") }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With